You are on page 1of 802

Deep Learning

Deep Learning
Ian Go
Gooodfello
dfellow
w
Yosh
oshua
ua Bengio
Ian GoCourville
Aaron odfellow
Yoshua Bengio
Aaron Courville
Con
Conten
ten
tents
ts
Contents
Website vii

Wcebsite
A kno
knowledgmen
wledgmen
wledgments
ts vii
viii

Acknowledgments
Notation viii
xi

Notation
1 In
Intro
tro
troduction
duction xi1
1.1 Who Should Read This Bo ok? . . . . . . . .
Book? . . . . . . . . . . . . 8
1 1.2
Introduction
Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 1
11
1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8
1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 11
I Applied Math and Mac Machine
hine Learning Basics 29

I Applied
2 Math and Machine Learning Basics
Linear Algebra 29
31
2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 31
2 2.2
LinearMultiplying
Algebra Matrices and Vectors . . . . . . . . . . . . . . . . . . 31
34
2.1 Scalars,
2.3 Iden
Identit
tityV
tity ectors,
and In Matrices
Inverse
verse Matrices and T . ensors
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 31
36
2.2 Linear
2.4 Multiplying
Dep Matrices
Dependence
endence andand Vectors
Span . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 34
37
2.3 Norms
2.5 Identity .and
. . In
. verse
. . . .Matrices
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 36
39
2.4 Sp
2.6 Linear
ecial Dep
Special Kindsendence and Span
of Matrices and V. ectors
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 37
40
2.5 Eigendecomp
2.7 Norms . . . osition
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Eigendecomposition .. .. .. .. .. .. .. .. .. .. .. .. 39
42
2.8 Singular ValueofDecomp
2.6 Sp ecial Kinds Matrices
Decomposition and V
osition . ectors
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 40
44
2.7 The
2.9 Eigendecomp
Mo
Moore-P
ore-P osition Pseudoinv
ore-Penrose
enrose . . . . . . erse
Pseudoinverse . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 42
45
2.8 The
2.10 Singular
Trace Value
Op Decomp. osition
Operator
erator . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 44
46
2.9 The
2.11 The Determinan
Mo ore-Penrose
Determinant t . Pseudoinv
. . . . . . erse . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 45
47
2.10 The T race Op erator
2.12 Example: Principal Comp . .
Components. .
onents. . .Analysis
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 46
48
2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.12 Example:
3 Probabilit
Probability y and Principal Components
Information Theory Analysis . . . . . . . . . . . . . 48
53
3.1 WhWhy y Probabilit
Probability?y? . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3 Probability and Information Theory 53
3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
i

i
CONTENTS

3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 56


3.3 Probabilit
Probability y Distributions . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Random V ariables y. .. .. .. .. .. .. .. .. .. .. .. .. ..
3.4 Marginal Probabilit
Probability .. .. .. .. .. .. .. .. .. .. .. .. 56
58
3.3 Conditional
3.5 Probability Distributions
Probabilit
Probability y .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 56
59
3.4 Marginal Probabilit y
3.6 The Chain Rule of Conditional . . . . . .Probabilities
. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 58
59
3.5 Indep
3.7 Conditional
endenceProbabilit
Independence y . . . Indep
and Conditional . . . .endence
Independence. . . . .. .. .. .. .. .. .. .. .. .. .. .. 59
60
3.6 Exp
3.8 The Chain Rule
Expectation,
ectation, of Conditional
Variance and Co Cov Probabilities
variance . . . .. .. .. .. .. .. .. .. .. .. .. .. 59
60
3.7 Indep endence
3.9 Common Probabilit and
ProbabilityConditional Indep
y Distributions . . . . . endence .. .. .. .. .. .. .. .. .. .. .. .. 60
62
3.8 Exp ectation,
3.10 Useful Prop V ariance
Properties and CovFariance
erties of Common unctions . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 60
67
3.9 Ba
3.11 Common
Bayyes’ RuleProbabilit
. . . . y. Distributions
. . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 62
70
3.10 T
3.12 Useful
echnicalProp erties of
Details of Con
Common
Contintin
tinuous
uous Functions
Variables . .. .. .. .. .. .. .. .. .. .. .. .. .. 67
71
3.11 Information
3.13 Bayes’ Rule Theory
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 70
72
3.12 Technical Details of Contin
3.14 Structured Probabilistic Mo uous
Modelsdels V . ariables
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 71
75
3.13 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 3.14 Structured
Numerical Probabilistic Models . . . . . . .
Computation . . . . . . . . . . . . 75
80
4.1 OvOverflo
erflo
erflow w and Underflo
Underflow w . . . . . . . . . . . . . . . . . . . . . . . 80
4 4.2
Numerical Computation
Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 80
82
4.1 Ov erflowt-Based
4.3 Gradien and Underflo
Gradient-Based w . . . .. .. .. .. .. .. .. ..
Optimization .. .. .. .. .. .. .. .. .. .. .. .. 80
82
4.2 Constrained
4.4 Poor Conditioning . . . . .. .. .. .. .. .. .. .. .. ..
Optimization .. .. .. .. .. .. .. .. .. .. .. .. 82
93
4.3 Example:
4.5 Gradient-Based
LinearOptimization
Least Squares. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 82
96
4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 93
4.5
5 Mac Example:
Machine
hine Linear
Learning Least Squares . . . . . . .
Basics . . . . . . . . . . . . 96
98
5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 5.2
Machine Learning
Capacit
Capacity Basicsand Underfitting . . .
y, Overfitting . . . . . . . . . . . . 98
110
5.1 Hyp
5.3 Learning Algorithms
Hyperparameters
erparameters and V . alidation
. . . . . .Sets . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 99
120
5.2 Estimators,
5.4 Capacity, Overfitting
Bias and V and Underfitting
ariance . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 110
122
5.3 Hyp
5.5 Maxim erparameters
Maximum um Lik
Likeliho
elihoand
elihoood Estimation Sets
V alidation . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 120
131
5.4 Ba
5.6 Estimators,
Bay Bias and. V. ariance
yesian Statistics . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 122
135
5.5 Sup
5.7 Maxim um Lik
Supervised
ervised elihoodAlgorithms
Learning Estimation. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 131
139
5.6 Unsup
5.8 Ba yesian Statistics
Unsupervised
ervised Learning. . Algorithms
. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 135
145
5.7 Sto
5.9 Sup
Stoccervised Learning
hastic Gradien
Gradient Algorithms
t Descen
Descent t . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 139
150
5.8 Unsup ervised Learning Algorithms
5.10 Building a Machine Learning Algorithm . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 145
152
5.9 Challenges
5.11 Stochastic Gradien
Motiv t Descen
Motivating
ating Deept Learning
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 150
154
5.10 Building a Machine Learning Algorithm . . . . . . . . . . . . . . 152
5.11 Challenges Motivating Deep Learning . . . . . . . . . . . . . . . . 154
II Deep Net
Netw works: Mo Modern
dern Practices 165

II Deep
6 Deep FNetworks:
eedforw ardMo
eedforward dern
Netw
NetworksPractices
orks 165
167
6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 170
6 6.2
Deep Gradien
Feedforw ard Netw
Gradient-Based
t-Based orks. . . . . .
Learning . . . . . . . . . . . . . . . . . 167
176
6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 170
6.2 Gradient-Based Learning . . ii. . . . . . . . . . . . . . . . . . . . . 176
CONTENTS

6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190


6.4 ArcArchitecture
hitecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.3
6.5 BacHidden Units . . and
Back-Propagation
k-Propagation . . . Other
. . . .Differen
. . . . tiation
. . . . Algorithms
Differentiation . . . . . . . .. .. .. .. .. 203 190
6.4 Historical
6.6 Architecture Design
Notes . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 224
196
6.5 Back-Propagation and Other Differentiation Algorithms . . . . . 203
7 6.6 Historical Notes
Regularization . . . Learning
for Deep . . . . . . . . . . . . . . . . . . . . . . . . . 228 224
7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230
7 Regularization
7.2 Norm Penalties for Deep LearningOptimization . . . . . . . . . . . . 228
as Constrained 237
7.1 Parameter Norm P enalties . . . . . . . . .
7.3 Regularization and Under-Constrained Problems . . . . . . . . . 239 . . . . . . . . . . . . . 230
7.2 Dataset
7.4 Norm Penalties
Augmen as
Augmentation Constrained
tation . . . . .Optimization
. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 240 237
7.3 Noise
7.5 Regularization
Robustness and. Under-Constrained
. . . . . . . . . . . Problems . . . . . . .. .. .. .. .. .. .. .. .. 242 239
7.4 Semi-Sup
7.6 Dataset Augmen
ervised tation
Semi-Supervised Learning . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 244
240
7.5 Multi-T
7.7 Noise Robustness
Multi-Task
ask Learning . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245
242
7.6 Semi-Sup ervised Learning . . . . . . . . . . .
7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 . . . . . . . . . . . 244
7.7 P
7.9 Multi-T
arameterask TLearning
ying and P . arameter
. . . . . .Sharing
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245251
7.8 Early Stopping
7.10 Sparse Represen .
Representations . . . . . . . . . . . . . . . . . . .
tations . . . . . . . . . . . . . . . . . . . . . . . . 253 . . . . . . . . 246
7.9 Parameter T
7.11 Bagging and Other ying and Parameter
Ensemble Sharing
Metho
Methods ds . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255251
7.10 Sparse
7.12 Drop
Dropout Represen tations . . . . . . . . . . . . . . .
out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 . . . . . . . . . 253
7.11 Bagging
7.13 Adv andTOther
dversarial
ersarial rainingEnsemble
. . . . .Metho . . . ds. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255
267
7.12 T
7.14 Drop out Distance,
angent . . . . . .Tangent
. . . . .Prop, . . . and
. . .Manifold
. . . . . T.angent
. . . . Classifier
. . . . . 268 257
7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267
8 7.14 Tangent Distance,
Optimization for Training Tangent Deep Prop, Mo and
delsManifold Tangent Classifier 274
Models 268
8.1 Ho Howw Learning Differs from Pure Optimization . . . . . . . . . . . 275
8 8.2 Challengesfor
Optimization in T raining
Neural Deep
Netw
Network Models
ork Optimization . . . . . . . . . . . . 274 282
8.1 Basic
8.3 How Learning
Algorithms Differs
. . from
. . .P . ure
. . Optimization
. . . . . . . . .. .. .. .. .. .. .. .. .. .. .. 294 275
8.2 P
8.4 Challenges in Neural Netw
arameter Initialization ork Optimization
Strategies . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 282 301
8.3 Basic Algorithms
8.5 Algorithms with Adaptiv . . . . . . . . . . . . . . . . . .
Adaptivee Learning Rates . . . . . . . . . . . . . 306 . . . . . . . . . 294
8.4 Parameter
8.6 Appro
Approximate Initialization
ximate Second-Order Strategies
Metho
Methods ds. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 310
301
8.5 Optimization
8.7 Algorithms with Adaptivand
Strategies e Learning Rates . . .. .. .. .. .. .. .. .. .. .. .. 318
Meta-Algorithms 306
8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 310
9 8.7
Con
Conv Optimization
volutional Netw Strategies
Networksorks and Meta-Algorithms . . . . . . . . . . . 331 318
9.1 The Con Conv volution Op Operation
eration . . . . . . . . . . . . . . . . . . . . . 332
9 Con volutional
9.2 Motiv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Motivation Netw orks 336
9.1 P
9.3 The Con.volution
ooling . . . . .Op . .eration
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 340
332
9.2 Con
9.4 Motiv
Conv ation . and
volution . . .Po. oling
. . . as. .an . .Infinitely
. . . . . Strong
. . . . Prior
. . . .. .. .. .. .. .. .. 346
336
9.3 V
9.5 Pariants
ooling . of. the
. . .Basic
. . . Con. . v. olution
Conv . . . .F . unction
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348 340
9.4 Con v olution and P o oling as an Infinitely Strong
9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359 Prior . . . . . . . 346
9.5 Data
9.7 Variants
Typ ofesthe
ypes . .Basic
. . . Con. . v. olution
. . . .F . unction
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348 361
9.6 Structured
9.8 Efficien
Efficientt Con
Conv Outputs . . . . . . . . . . . . . . . . .
volution Algorithms . . . . . . . . . . . . . . . . . . 363. . . . . . . . . 359
9.7 Data T yp
9.9 Random or Unsupes . . . ervised
. . . . .Features
Unsupervised . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 364 361
9.8 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 363
9.9 Random or Unsupervised Features iii . . . . . . . . . . . . . . . . . 364
CONTENTS

9.10 The Neuroscien


Neuroscientific
tific Basis for Conv
Convolutional
olutional Netw Networksorks . . . . . . . 365
9.11 Con
Convvolutional Net
Networks
works and the History of Deep Learning . . . . 372
9.10 The Neuroscientific Basis for Convolutional Networks . . . . . . . 365
10 9.11 ConvMo
Sequence olutional
Modeling:
deling:NetRecurrent
works and the and History
Recursiv
Recursive of Deep
e Nets Learning . . . . 374
372
10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . . . . 376
10 Sequence
10.2 Motdeling:
Recurren
Recurrent Neural Recurrent
Net
Netw works .and . . .Recursiv
. . . . . e. Nets. . . . . . . . . . . 374379
10.1 Bidirectional
10.3 Unfolding Computational
RNNs . . . .Graphs . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 396 376
10.2 Enco
10.4 Recurren t Neural
Encoder-Deco
der-Deco
der-Decoder Networks . . . . . . .Architectures
der Sequence-to-Sequence . . . . . . . .. .. .. .. .. .. .. 379
397
10.3 Deep
10.5 Bidirectional
RecurrenRNNs
Recurrent t Net
Netww. orks
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 399 396
10.4 Enco der-Deco
10.6 Recursiv der
Recursivee Neural NetSequence-to-Sequence
Netwworks . . . . . . . .Architectures
. . . . . . . .. .. .. .. .. .. .. 397 401
10.5 The
10.7 DeepChallenge
RecurrentofNet workserm. .Dep
Long-T
Long-Term . .endencies
. . . . . .. .. .. .. .. .. .. .. .. ..
Dependencies .. .. .. 403399
10.6 Ec
10.8 Recursiv
Echo
ho Statee Neural
Net
Netw Netw.orks
works . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 406 401
10.7 Leaky
10.9 The Challenge
Units andof Other
Long-T erm Dependencies
Strategies for Multiple . . Time
. . . .Scales
. . . .. .. .. .. 409
403
10.8 Echo State Net
10.10 The Long Short-T w orks
erm Memory and Other Gated RNNs .. .. ..
Short-Term . . . . . . . . . . . . . . . . . . . .. .. .. 406411
10.9 Optimization
10.11 Leaky Units andfor Other
Long-T Strategies
Long-Termerm Dep for Multiple
Dependencies
endencies . . Time
. . . .Scales
. . . .. .. .. .. 409
415
10.10 Explicit
10.12 The LongMemory
Short-Term. . .Memory
. . . . .and . . Other
. . . . Gated
. . . .RNNs. . . .. .. .. .. .. .. 419 411
10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 415
11 10.12 Explicit
Practical methoMemory
dology. . . . . . . . . . . . . . . . . . . . . . . .
methodology . . . 424 419
11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 425
11 Practical
11.2 metho
Default dology
Baseline Mo dels . . . . . . . . . . . . . . . . . . . .
Models . . . 424 428
11.3 Determining Whether .to. Gather
11.1 Performance Metrics . . . . .More . . .Data. . . .. .. .. .. .. .. .. .. .. .. .. .. 429 425
11.2 Selecting
11.4 Default Baseline
Hyp Models . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Hyperparameters
erparameters .. .. .. 430 428
11.5 Debugging Strategies .to. Gather
11.3 Determining Whether . . . . .More . . .Data
. . . .. .. .. .. .. .. .. .. .. .. .. .. 439 429
11.4 Example:
11.6 Selecting Hyp erparameters
Multi-Digit Number . . Recognition
. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 443 430
11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 439
12 11.6 Example: Multi-Digit Number Recognition . . . . . . . . . .
Applications . . . 446443
12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 446
12 Applications
12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 455
12.1 Large
12.3 Sp
Speec
eechScale
eech Deep Learning
Recognition . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 446 461
12.2 Natural
12.4 Computer Vision Pro
Language . . cessing
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Processing .. .. .. 464 455
12.3 Other
12.5 SpeechApplications
Recognition .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 480461
12.4 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 464
12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 480
III Deep Learning Researc
Research h 489

III Linear
13 Deep FLearning
actor Mo Researc
Models
dels h 489
492
13.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 493
13 13.2
LinearIndep
Factor
IndependenMo
enden
endent dels onent Analysis (ICA)
t Comp
Component . . . . . . . . . . . . . . 492
494
13.1 Probabilistic
13.3 Slo
Slow PCA and F.actor
w Feature Analysis . . . Analysis
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 493
496
13.2 Sparse
13.4 Independen
Co t Comp
Coding
ding . . .onent
. . . Analysis
. . . . . (ICA)
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 494
499
13.3 Slow Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . 496
13.4 Sparse Coding . . . . . . . . iv. . . . . . . . . . . . . . . . . . . . . 499
CONTENTS

13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 502

14 13.5
Auto Manifold
Autoenco
enco ders Interpretation of PCA . . . . . . . . . . . . . . . . . . .
encoders 502
505
14.1 Undercomplete Auto Autoenco
enco
encoders
ders . . . . . . . . . . . . . . . . . . . . 506
14 14.2
Autoenco ders
Regularized Auto
Autoenco
enco
encoders
ders . . . . . . . . . . . . . . . . . . . . . . 505
507
14.1 Undercomplete
14.3 Represen
Representational Auto enco
tational Power, La ders
Lay . . . and
yer Size . . .Depth
. . . .. .. .. .. .. .. .. .. .. .. .. 506
511
14.2 Sto
14.4 Regularized Auto
Stocchastic Enco
Encoders enco
ders dersDeco
and . .ders
Decoders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 507
512
14.3 Denoising
14.5 Representational
Auto
Autoenco Power,
enco ders La
encoders . y. er. .Size
. . and
. . .Depth
. . . .. .. .. .. .. .. .. .. .. .. .. 511
513
14.4 Learning
14.6 StochasticManifolds
Encoderswith and Deco
Auto
Autoenco ders
enco . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
encoders
ders 512
518
14.5 Denoising
14.7 Con
Contractiv
tractiv Auto
tractivee Autoenco
Autoenco ders
enco ders . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
encoders . . . . . . 513
524
14.6 Predictiv
14.8 Learning
PredictiveeManifolds with Auto
Sparse Decomp
Decomposition
osition encoders
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 518
526
14.7 Applications
14.9 Contractive Auto encoenco
of Auto dersders
Autoenco
encoders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 524
527
14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 526
15 14.9 Applications
Represen
Representation of Autoencoders . . . . . . . . . . . . . . . . . . . .
tation Learning 527
529
15.1 Greedy La Lay yer-Wise Unsup
Unsupervised
ervised Pretraining . . . . . . . . . . . 531
15 15.2
Represen tation
Transfer Learning
Learning and Domain Adaptation . . . . . . . . . . . . . 529
539
15.1 Semi-Sup
15.3 Greedy Laervised
yer-Wise
Semi-Supervised Unsupervised
Disentangling of P retraining
Causal Factors . . .. .. .. .. .. .. .. .. .. 531
544
15.2 Distributed
15.4 Transfer Learning and Domain
Representation . . .A.daptation
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 539
549
15.3 Semi-Sup
15.5 Exp
Exponen
onen tial Gains from Depth . . . . . F. actors
ervised
onential Disentangling of Causal . . . . .. .. .. .. .. .. .. .. .. 544
556
15.4
15.6 Distributed
Pro Representation
viding Clues
Providing to Disco
Discov ver. Underlying
. . . . . . .Causes . . . . .. .. .. .. .. .. .. .. .. .. 549
557
15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . 556
16 15.6 Providing
Structured Clues to Disco
Probabilistic Mo ver
ModelsdelsUnderlying
for DeepCauses Learning . . . . . . . . . . 557
561
16.1 The Challenge of Unstructured Mo Modeling
deling . . . . . . . . . . . . . . 562
16 16.2
Structured Probabilistic
Using Graphs to Describ Mo
Describee Models
Model for Deep Learning
del Structure . . . . . . . . . . . . . 561
566
16.1
16.3 The Challenge
Sampling from of Unstructured
Graphical Mo
Modelsdels Mo .deling
. . . .
. .. .. .. .. .. .. .. .. .. .. .. .. .. 562
583
16.2 A
16.4 Using
dv Graphs
dvantages
antages of to Describe Mo
Structured Modeling
del Structure
Modeling . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 566
584
16.3 Learning
16.5 Sampling ab from
outGraphical
about Dep
Dependencies Models
endencies . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 583
585
16.4 Inference
16.6 Advantages and of Approximate
Structured Mo deling . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Inference 584
586
16.7 The Deep Learning Approach to. .Structured
16.5 Learning ab out Dep endencies . . . . . . Probabilistic
. . . . . . . Mo . . dels
Models. . . 585
587
16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . 586
17 16.7
Mon
MonteteThe DeepMetho
Carlo Learning
Methods ds Approach to Structured Probabilistic Models 587
593
17.1 Sampling and Monte Carlo Metho Methods ds . . . . . . . . . . . . . . . . 593
17 17.2
Monte Carlo
Imp
ImportanceMetho ds
ortance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 593
595
17.1 Sampling
17.3 Mark
Marko and Monte
ov Chain Mon
Montete Carlo
Carlo MethoMetho
Methods ds
ds .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 593
598
17.2 Gibbs
17.4 Importance Sampling
Sampling . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 595
602
17.3 The
17.5 MarkChallenge
ov Chain Mon te Carlo
of Mixing betw Metho
etween ds . . . . Mo
een Separated . . des
Modes . . .. .. .. .. .. .. .. .. 598
602
17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
18 17.5
ConfronTheting
Confronting Challenge of MixingFunction
the Partition between Separated Modes . . . . . . . . 602
608
18.1 The Log-Lik
Log-Likeliho
eliho
elihoood Gradient . . . . . . . . . . . . . . . . . . . . 609
18 18.2
ConfronStoting
Stoc theMaximum
chastic Partition Function
Likelihoo
Likelihood d and Contrastiv
Contrastivee Divergence . . . 608
610
18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 609
18.2 Stochastic Maximum Likelihoo v d and Contrastive Divergence . . . 610
CONTENTS

18.3 Pseudolik
Pseudolikeliho
eliho
elihoood . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
18.4 Score Matc
Matching
hing and Ratio Matching . . . . . . . . . . . . . . . . 620
18.3 Denoising
18.5 Pseudolikeliho
ScoreodMatching
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 622618
18.4 Noise-Con
18.6 Score Matctrastiv
hing and
Noise-Contrastiv
trastive Ratio Matching
e Estimation . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 623 620
18.5 Denoising Score Matching
18.7 Estimating the Partition Function . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 626 622
18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 623
19 18.7
Appro Estimating
Approximate the Partition Function . . . . . . . . . . . . . .
ximate inference . . . . 634
626
19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 636
19 19.2
ApproExp
ximate
Expectationinference
ectation Maximization . . . . . . . . . . . . . . . . . . . . . . 634637
19.1 Inference as Optimization
19.3 MAP Inference and Sparse Co . . ding
Coding . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 638 636
19.2 V
19.4 Exp ectationInference
ariational Maximization . . . . . .. .. .. .. .. .. .. .. .. .. .. .. ..
and Learning .. .. .. .. 637
641
19.3 Learned
19.5 MAP Inference
Appro and Sparse
Approximate
ximate Coding
Inference . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 653 638
19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 641
20 19.5
Deep Learned
GenerativAppro
Generative e Moximate
dels Inference . . . . . . . . . . . . . . .
Models . . . . 656
653
20.1 Boltzmann Mac Machines
hines . . . . . . . . . . . . . . . . . . . . . . . . . 656
20 20.2
Deep Restricted
GenerativBoltzmann
e Mo dels Machines . . . . . . . . . . . . . . . . . . . 656
658
20.1 Deep
20.3 Boltzmann
Belief Mac
Netwhines
orks .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Networks .. .. .. .. 662 656
20.2 Deep
20.4 Restricted Boltzmann
Boltzmann Machines
Machines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 665
658
20.3 Deep Belief
20.5 Boltzmann Mac Netw orks
Machines . . .
hines for Real-V . . .alued
Real-Valued . . . Data
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. 662
678
20.4
20.6 Deep
Con
Conv Boltzmann
volutional MachinesMac
Boltzmann . . hines
Machines. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 665
685
20.5 Boltzmann
20.7 Boltzmann Mac Mac hines for
Machines
hines for Structured
Real-ValuedorData . . . . Outputs
Sequential . . . . . .. .. .. .. 678
687
20.6 Other
20.8 Convolutional
BoltzmannBoltzmann
Machines Mac. hines
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 688 685
20.7 Boltzmann
20.9 Bac Mac
Back-Propagation hines for Structured
k-Propagation through Random Op or Sequential
erations . Outputs
Operations . . . . . .. .. .. .. 689
687
20.10 Directed Generative Nets . . . . . . . . . . . . . . . .. .. .. ..
20.8 Other Boltzmann Machines . . . . . . . . . . . . . .. .. .. .. 688
694
20.9 Dra
20.11 Bac k-Propagation
Drawing
wing Samples from through
Auto Random
Autoenco enco dersOp. erations
encoders . . . . . .. .. .. .. .. .. .. .. .. .. 712
689
20.10 Generativ
20.12 Directed
GenerativeGenerative
StocchasticNets
e Sto Net
Netw .works
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 716694
20.13 Other Generation Schemes . enco
20.11 Dra wing Samples from Auto . . ders
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 712 717
20.12 Ev
20.14 Generativ
aluatinge Generative
Evaluating Stochastic Net Mo w
Modelsorks. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
dels .. .. .. .. 719
716
20.13 Conclusion
20.15 Other Generation
. . . . Schemes
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 717721
20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 719
20.15 Conclusion
Bibliograph
Bibliography y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
721

Bibliography
Index 723
780

Index 780

vi
Website
Website www.deeplearningb
www.deeplearningboook.org

www.deeplearningbook.org

This book is accompanied by the ab abov


ov
ovee website. The website provides a
variety of supplemen
supplementary
tary material, including exercises, lecture slides, corrections of
This
mistak es, and other resources thatbyshould
mistakes, b o ok is accompanied the ab
beov e website.
useful to both The website
readers provides a
and instructors.
variety of supplementary material, including exercises, lecture slides, corrections of
mistakes, and other resources that should be useful to both readers and instructors.

vii

vii
Ackno
knowledgmen
wledgmen
wledgments
ts
Acknowledgments
This book would not ha
hav
ve been possible without the con
contributions
tributions of man
many
y people.

ThisWbeook would
wouldlik
likeenot
to hathank those
ve been who commen
possible commented
without ted
the on
conour prop
proposal
tributionsosal for the
of man book
y people.
and help
helped
ed plan its conconten
ten
tents
ts and organization: Guillaume Alain, Kyungh Kyunghyunyun Cho,
W e w ould
Çağlar Gülçehre, Dalik e to
David thank those who commen
vid Krueger, Hugo Larochelle ted on
Larochelle,, Razv our
Razvan prop
an Pascanosal
Pascanuu and the
for book
Thomas
and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho,
Rohée.
Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas
We would like to thank the people who offered feedback on the conten contentt of the
Rohée.
book itself. Some offered feedbac feedbackk on many chapters: Martín Abadi, Guillaume
W e w ould like to thank the
Alain, Ion Androutsopoulos, Fred Bertsc people who
Bertsch, offered
h, Olexa feedback
Bilaniuk, on Can
Ufuk the conten
Biçici,tMatk
of the
Matko o
book itself.
Bošnjak, Some
John offered Greg
Boersma, feedbac k on
Bro
Broc manyPierre
ckman, chapters:
Luc Martín
Carrier,Abadi,
SarathGuillaume
Chandar,
Alain,
P Ion Androutsopoulos,
awel Chilinski, Mark Daoust, Fred Bertsc
Oleg h, Olexa Bilaniuk,
Dashevskii, Ufuk Can
Laurent Dinh, Biçici,
Stephan Matko
Dreseitl,
Bošnjak,
Jim John F
Fan, Miao Boersma,
an, MeireGreg Brockman,
Fortunato, FrédéricPierre Luc Carrier,
Francis, Nando de Sarath Chandar,
Freitas, Çağlar
P a w el Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh,
Gülçehre, Jurgen Van Gael, Javier Alonso García, Jonathan Hunt, Gopi Jeyaram, Stephan Dreseitl,
Jim Fan,Kab
Chingiz Miao
Kabyta
yta
ytayyFev,
an,Luk
Meire
Lukasz Fortunato,
asz Kaiser, Frédéric
Varun Kanade, Francis, NandoJohn
Akiel Khan, de FKing,
reitas,Diederik
Çağlar
Gülçehre,
P . Kingma, Jurgen
Yann VLeCun,
an Gael,Rudolf
JavierMathey
Alonso, García,
Mathey, Matías Jonathan
Mattamala, Hunt, Gopi Jeyaram,
Abhinav Maurya,
Chingiz
Kevin MurphyKab yta y ev, Luk asz Kaiser,
Murphy,, Oleg Mürk, Roman Nov V arun Kanade,
Novak, Akiel Khan, John
ak, Augustus Q. Odena, Simon PaKing, Diederik
Pavlik,
vlik,
P . Kingma, Y ann LeCun, Rudolf
Karl Pichotta, Kari Pulli, Tapani Raiko, An Mathey , Matías
Anurag Mattamala, Abhinav Maurya,
urag Ranjan, Johannes Roith, Halis
KevinCésar
Sak, Murphy , OlegGrigory
Salgado, Mürk, Sapunov,
Roman Nov Mikak,
Mike Augustus
e Sch
Schuster, Q. Odena,
uster, Julian Simon
Serban, Pavlik,
Nir Shabat,
Karl Shirriff,
Ken Pichotta,Scott
KariStanley
Pulli, T
Stanley, , apani
DavidRaiko, Anurag
Sussillo, Ranjan,
Ilya Sutsk
Sutskev
ev
ever,Johannes
er, Roith, Halis
Carles Gelada Sáez,
Sak, César Salgado,
Graham Taylor, Valen Grigory
alentin Sapunov, Mik e Sch uster, Julian Serban,
tin Tolmer, An Tran, Shubhendu Trivedi, Alexey Umnov, Nir Shabat,
Ken
Vincen Shirriff,
Vincentt Vanhouc Scott
anhouck ke, Marco, David
Stanley Visen Sussillo, Ilya Sutsk
Visentini-Scarzanella,
tini-Scarzanella, Da ever,WCarles
David
vid arde-F Gelada
arde-Farley
arley
arley, Sáez,
, Dustin
Graham
W Taylor,
ebb, Kelvin Xu,Valen
Wei tin Tolmer,
Xue, Li Yao,An Tran, tShubhendu
Zygmun
Zygmunt Za jąc and T
Zając rivedi,
Ozan Alexey
Çağlay an. Umnov,
Çağlayan.
Vincent Vanhoucke, Marco Visentini-Scarzanella, David Warde-Farley, Dustin
We would also lik likee to thank those who provided us with useful feedbac feedback k on
Webb, Kelvin Xu, Wei Xue, Li Yao, Zygmunt Za jąc and Ozan Çağlayan.
individual chapters:
We would also like to thank those who provided us with useful feedback on
individual chapters:
• Chapter 1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi,
Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu
Chapter
and 1, Introduction
Alfredo Solano. : Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi,
• Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu
• Chapter 2, Linear
and Alfredo Algebra: Amjad Almahairi, Nik
Solano. Nikola
ola Banić, Kevin Bennett,
viiiAlmahairi, Nikola Banić, Kevin Bennett,
Chapter 2, Linear Algebra: Amjad
• viii
CONTENTS

Philipp
Philippee Castonguay
Castonguay,, Oscar Chang, Eric Fosler-Lussier, Andrey Khaly Khalya avin,
Sergey Oreshk
Oreshko ov, Istv
István
án Petrás, Dennis Prangle, Thomas Rohée, Colb Colby
y
Philipp e Castonguay , Oscar Chang, Eric Fosler-Lussier,
Toland, Massimiliano Tomassoli, Alessandro Vitale and Bob Welland. Andrey Khaly avin,
Sergey Oreshkov, István Petrás, Dennis Prangle, Thomas Rohée, Colby
• Toland, Massimiliano
Chapter 3, ProbabilityTand omassoli, Alessandro
Information TheoryVitale and
: John Bob Anderson,
Philip Welland. Kai
Arulkumaran, VincenVincentt Dumoulin, Rui Fa, Stephan Gouws, Artem Ob Oboturov,
oturov,
Chapter
An
Antti 3 , Probability
tti Rasmus, Andre Simp and Information
Simpelo,
elo, Alexey SurkTheory
Surko : John
ov and Volk Philip
olker Anderson,
er Tresp. Kai
• Arulkumaran, Vincent Dumoulin, Rui Fa, Stephan Gouws, Artem Oboturov,
• Antti Rasmus,
Chapter Andre Simp
4, Numerical elo, Alexey
Computation : TSurk
ran oLam
v andAn,Volk er T
Ian resp. and Hu
Fischer,
Yuhuang.
Chapter 4, Numerical Computation: Tran Lam An, Ian Fischer, and Hu
• Yuhuang.5, Machine Learning Basics: Dzmitry Bahdanau, Nikhil Garg,
Chapter
Mak
Makoto
oto Otsuk
Otsuka, a, Bob Pepin, Philip Popien, Emmanuel Ra Rayner,
yner, Kee-Bong
Chapter 5 , Machine
Song, Zheng Sun and Andy Wu. Learning Basics : Dzmitry Bahdanau, Nikhil Garg,
• Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Kee-Bong
• Song, Zheng
Chapter 6, DeepSun F and Andyard
eedforw
eedforward Wu.Netw
Networks
orks
orks:: Uriel Berdugo, Fabrizio Bottarel,
Elizab
Elizabeth
eth Burl, Ishan DurugkDurugkar,ar, Jeff Hlywa, Jong Wook Kim, Da David
vid Krueger
Chapter
and Adit
Adityy6a, Deep
Kumar Feedforw
Prahara ard
Praharaj.j. Networks: Uriel Berdugo, Fabrizio Bottarel,
• Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Wook Kim, David Krueger
• and Adity
Chapter 7a, Regularization
Kumar Praharafor j. Deep Learning: Inkyu Lee, Sunil Mohan and
Josh
Joshua
ua Salisbury
Salisbury..
Chapter 7, Regularization for Deep Learning: Inkyu Lee, Sunil Mohan and
• Joshua Salisbury
Chapter .
8, Optimization for Training Deep Mo Models
dels
dels:: Marcel Ackermann,
Ro
Rowwel Atienza, Andrew Bro Brock,
ck, Tegan Mahara
Maharaj,j, James Martens and Klaus
Chapter 8, Optimization for Training Deep Models: Marcel Ackermann,
Strobl.
• Rowel Atienza, Andrew Brock, Tegan Mahara j, James Martens and Klaus
• Strobl. 9, Conv
Chapter Convolutional
olutional Netw
Networks
orks
orks:: Martín Arjovsky
Arjovsky,, Eugene Brevdo, Eric
Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie
Chapter
Sa
Say 9, Conv
yer, Ryan olutional
Stout Networks
and Wentao Wu.: Martín Arjovsky, Eugene Brevdo, Eric
• Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie
• Sayer, Ryan
Chapter 10, Stout and Modeling:
Sequence Wentao Wu.Recurren
Recurrentt and Recursive Nets Nets:: Gökçen
Eraslan, Stev
Steven en Hickson, Razv
Razvanan Pascan
Pascanu,u, Lorenzo von Ritter, Rui RoRodrigues,
drigues,
Chapter 10 , Sequence Modeling: Recurren t and
Mihaela Rosca, Dmitriy Serdyuk, Dongyu Shi and Kaiyu Yang. Recursive Nets : Gökçen
• Eraslan, Steven Hickson, Razvan Pascanu, Lorenzo von Ritter, Rui Rodrigues,
• Mihaela
Chapter Rosca, Dmitriy
11, Practical Serdyuk,
metho dologyDongyu
methodology : DanielShi and
Bec Kaiyu Yang.
Beckstein.
kstein.

• 11, Applications
Chapter 12 Practical metho dologyDahl
: George : Daniel
and Beckstein.
Ribana Roscher.

• 12, Representation
Chapter 15 Applications: George Dahl
Learning and Ribana
: Kunal Ghosh.Roscher.

• Chapter 15
16, Representation Learning: Mo
Structured Probabilistic Kunal
dels Ghosh.
Models for Deep Learning: Minh Lê
• and Anton Varfolom.
Chapter 16, Structured Probabilistic Models for Deep Learning: Minh Lê
• and
Chapter 18,VConfronting
Anton arfolom. the Partition Function
unction:: Sam Bowman.
ix
Chapter 18, Confronting the Partition Function: Sam Bowman.

CONTENTS

• Chapter 20, Deep Generativ


Generativee Mo
Models
dels
dels:: Nicolas Chapados, Daniel Galvez,
Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Monta Montavvon.
Chapter 20, Deep Generative Models: Nicolas Chapados, Daniel Galvez,
• W enming Ma,
Bibliograph
Bibliography: Fady Medhat,
y: Leslie N. Smith.Shakir Mohamed and Grégoire Montavon.

Bibliography: Leslie N. Smith.


We also wan antt to thank those who allo allowwed us to repro
reproduce
duce images, figures or
data• from their publications. We indicate their contributions in the figure captions
We also the
throughout wantext.
t to thank those who allowed us to reproduce images, figures or
data from their publications. We indicate their contributions in the figure captions
We w
throughoutwould
ould
the lik
likee to thank Ian’s wife Daniela Flori Goo
text. Goodfellow
dfellow for patiently
supp
supporting
orting Ian during the writing of the book as well as for help with pro proofreading.
ofreading.
We would like to thank Ian’s wife Daniela Flori Goodfellow for patiently
We would lik likee to thank the GoGoogle
ogle Brain team for proproviding
viding an intellectual
supporting Ian during the writing of the book as well as for help with proofreading.
en
environmen
vironmen
vironmentt where Ian could dev devote
ote a tremendous amoun
amountt of time to writing this
W e would
book and receiv lik e to thank
receivee feedbac
feedback the Go ogle Brain team
k and guidance from colleagues. for pro
Weviding
wouldan espintellectual
especially
ecially like
en vironmen t where Ian could dev ote a tremendous amoun t of time
to thank Ian’s former manager, Greg Corrado, and his current manager, Samy to writing this
book andforreceiv
Bengio, theire supp
feedbac
ortkofand
support thisguidance
pro ject. from
project. colleagues.
Finally
Finally,
, we wouldWelike
would especially
to thank like
Geoffrey
to
Hin thank
ton forIan’s
Hinton former manager,
encouragement Greg Corrado,
when writing and his current manager, Samy
was difficult.
Bengio, for their support of this pro ject. Finally, we would like to thank Geoffrey
Hinton for encouragement when writing was difficult.

x
Notation
Notation
This section pro
provides
vides a concise reference describing the notation used throughout
this book. If you are unfamiliar with an any
y of the corresp
corresponding
onding mathematical
This section pro vides a concise reference
concepts, this notation reference ma may describing
y seem in the
intimidating.notation
timidating. Ho
Howwev er,used
ever, throughout
do not despair,
this b o ok.
we describ If you are unfamiliar with any
describee most of these ideas in chapters 2-4.of the corresp onding mathematical
concepts, this notation reference may seem intimidating. However, do not despair,
we describe most of these ideas Num in chapters
Numb 2-4. Arra
bers and Arrays
ys
a A scalar (in
(integer
teger Num
or real)
bers and Arrays
a A scalar
vector (integer or real)
A
a A matrix
vector
AA A tensor
matrix
IAn Iden
Identit
A tit
tity
y matrix with n ro
tensor rows
ws and n columns

II Iden
Identit
tity
tit matrix with
y matrix withndimensionalit
dimensionality
rows and n columns y implied by
con
context
text
I Identity matrix with dimensionality implied by
e(i) context basis vector [0, . . . , 0, 1, 0, . . . , 0] with a
Standard
1 at position i
e Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a
diag
diag((a) A square,
1 at diagonal
position i matrix with diagonal entries
giv
given
en by a
diag(a) A square, diagonal matrix with diagonal entries
a A
givscalar
en by random
a variable
a A scalar
A vector-v
ector-valued
alued random
random variablevariable
A
a A matrix-v
matrix-valued
alued random
vector-valued random vvariable
ariable

A A matrix-valued random variable

xi

xi
CONTENTS

Sets and Graphs


A A set Sets and Graphs
A The
R A setset of real num
umbers
bers
{0R
, 1} The
The set
set con
containing
taining
of real num0bers
and 1
{0, 10, ,. 1. . , n} The set con
of all in
integers
tegers
taining betw
between
0 and 1 een 0 and n
0, {1[a,
, . .b.]}, n The real
set ofin
interv
terv
terval
all altegers
in including
betwaeenand b n
0 and
{ ([a, a, bb]] } The The real
real in
interv
interv
terval excluding aa and
al including but bincluding b

(a,\B
A b] Set
The subtraction,
real interval i.e., the set
excluding containing
a but includingthe
b ele-
A B men
ments
ts of A that are not in B
Set subtraction, i.e., the set containing the ele-
G A graph A B
\ men ts of that are not in
P a G(xi ) The
A parents of xi in G
parents
graph
P a G(x ) The parents of x in
Indexing
G
ai Elementt i of vector a , withIndexing
Elemen indexing starting at 1
aa−i All elemen
elements
Elemen ts vofector
t i of vector
a, a except
with for elemen
indexingelement t i at 1
starting
A
a i,j Elemen
Elementt i, ts
All elemen j of
of matrix Aexcept for element i
vector a
A
A i,: Ro
Roww i of
Elemen t i,matrix A
j of matrix A
A :,i Column of matrix
Row i ofimatrix A A
AAi,j,k Elemen
Element
Columnt i(i,ofj,matrix
k) of aA3-D tensor A
A 2-D slice A
A :,:,i Elemen t (of a k3-D
i, j, ) oftensor
a 3-D tensor
Aa Elemen
Element t iofofathe
i 2-D slice 3-Drandom
tensor vector a
a Element i of the random vector a
Linear Algebra OpOperations
erations
A> Transpose of matrix
LinearAAlgebra Op erations
A+ Mo
Moore-P
T ore-P
ore-Penrose
enrose
ranspose pseudoin
pseudoinv
of matrix A verse of A
B
AA Elemen
Element-wise
Mo t-wise
ore-P enrose(Hadamard) product
pseudoinverse of A of A and B
det(
A A B) Determinan
Determinant
Elemen t of
t-wise A
(Hadamard) product of A and B
A)
det( Determinant of A

xii
CONTENTS

Calculus
dy
Deriv
Derivative
ative of y with resp
respect
ect to x.
dx Calculus
dy
∂y Derivative of y with respect to x.
dx Partial deriv
derivative
ative of y with resp
respect
ect to x
∂x
∂y
∇ xy Gradien
Gradientt of yative
Partial deriv withofresp
respect
ect to
y with x ect to x
resp
∂x
∇Xyy Matrix
Gradienderiv
derivatives
t of yatives
with of withtoresp
y ect
resp respect
x ect to X
∇ Xyy
∇ Tensor con
Matrix containing
taining
derivatives deriv
derivatives
of y atives of yect
with resp with
to resp
respect
X ect to
∇ y X
∂f Tensor containing derivatives of y with respect to
X
Jacobian matrix J ∈ Rm×n of f : Rn → Rm
∇∂ x
∂f R R R
∇2x f (x) or H (f )(x) The Hessian
Jacobian matrix
matrix J of f at of
input
f : point x
Z ∂x
Definite in
integral
tegral ∈ the en
over entire
tire → of x
domain
f (x)for
(xH
)dx
(f )(x) The Hessian matrix of f at input point x
Z
∇ Definite integral with
over the entire
f (x)dx integral respect to domain
x ov
over of xset S
er the
S
S
f (x)dx Definite integral with respect to x over the set
Probabilit
Probability
y and Information Theory
Z
a⊥b The random
Probability andvInformation
ariables a and Theory
b are indep
independen
enden
endentt
Z a⊥b | c They are are vconditionally
a b The random ariables a andindependent
b are indepgiv
given
en ct
enden
aP ⊥
(a) c
b A probabilit
probability
They are arey conditionally
distribution oindependent
ver a discretegiv
variable
en c

Pp(a)| A probabilit
probability
y distribution
distribution oovver
er aa discrete
contin
continuous
uous vari-
variable
able, or over a variable whose type has not been
p(a) A ecified
sp probability distribution over a continuous vari-
specified
able, or over a variable whose type has not been
a∼P Random
sp ecified variable a has distribution P
or Ef (x)
Ex∼P [f (ax)] P Exp
Expectation
ectation
Random of f (xa)has
variable with resp
respect
ect to P (x)
distribution
E x))Ef (x)
V (x∼
[far( f (or
)] Variance
Exp of fof
ectation (x)f (under x) ect to P (x)
P (resp
x) with
Co
Cov(
v(ar(
V f (fx()x
, g))(x)) Co
Cov
V variance
ariance of of ) and Pg(x) under P (x)
f (xf)(xunder
Cov(fH((xx)), g(x)) Shannon en
entrop
Covariance trop
tropy
of fy(xof the random
) and variable
g(x) under P (x) x
H((Pxk) Q)
D KL Kullbac
Kullback-Leibler
Shannonk-Leibler
entropy div
divergence
of ergence of Pvand
the random Q x
ariable
N
D (x(;Pµ, Q
Σ) Gaussian
Kullbac distribution
k-Leibler over x
over
divergence of with Q µ and
P andmean
co
cov
variance Σ
(x; µk, Σ) Gaussian distribution over x with mean µ and
N covariance Σ
xiii
CONTENTS

Functions
f :A→B The function f with domain A and range B
Functions
A B A and g B
f :f ◦ g Comp
Composition
The osition of
function the functions
f with domain f and range
f f(x;→
θ
g) A
Compfunction
ositionofofx the
parametrized
functions f by
andθ.g Sometimes
we just write f (x) and ignore the argument θ to
f (x◦; θ) A function
ligh
lighten of x parametrized by θ. Sometimes
ten notation.
we just write f (x) and ignore the argument θ to
log x Natural
ligh logarithm of x
ten notation.
1
σ (xx)
log Logistic sigmoid, of x
Natural logarithm
1 + exp(−x)
1
ζσ((xx)) Logistic sigmoid,
Softplus, log
log(1
(1 + exp( x ))
1 + exp( x)
||ζx p
(x||)p L norm of
Softplus, logx(1 + exp(x)) −
||xx|| L2 norm of x
|| xx+|| P
Lositiv
ositive
norme part
of x of x, i.e., max(0, x)
||x ||
1 condition is 1 if the
Positiv condition
e part is true,
of x, i.e., max(00 ,otherwise
x)
Sometimes we use a function f whose argumen argumentt is a scalar, but apply it to a vector,
1 is 1 if the condition is true, 0 otherwise
matrix, or tensor: f (x), f ( X), or f (X). This means to apply f to the arra array
y
Sometimes
elemen
element-wise. w e use a function f whose argumen t is a scalar, but apply it to a
t-wise. For example, if C = σ (X)X, then Ci,j,k = σ(Xi,j,k ) for all valid values vector,
matrix,
of i, j andor ktensor:
. f (x), f ( X), or f ( ). This means to apply f to the array
C X C X
element-wise. For example, if = σ ( ), then = σ( ) for all valid values
of i, j and k. Datasets and distributions
p data The data generating distribution
Datasets and distributions
pp̂ˆdata The empirical distribution
data generating defined by the training
distribution
set
pˆ The empirical distribution defined by the training
X A
setset of training examples
xX
(i)
The
A seti-th exampleexamples
of training (input) from a dataset
y (i) xor y (i) The target asso
associated
ciated
i-th example with x
(input)
( i)
fromfora sup
supervised
ervised learn-
dataset
ing
y or y The target associated with x for supervised learn-
(i)
X ing m × n matrix with input example x in ro
The row
w
Xi,:
X The m n matrix with input example x in row
X ×

xiv
Chapter 1
Chapter 1
In
Intro
tro
troduction
duction
In
In
Inv
tro
ven
entors
duction
tors ha
hav
ve long dreamed of creating mac
machines
hines that think. This desire dates
bac
backk to at least the time of ancien ancientt Greece. The mythical figures Pygmalion,
Inventors ha
Daedalus, andve long dreamedma
Hephaestus ofy creating
may machines that
all be interpreted think. This
as legendary in
invvdesire
en tors,dates
entors, and
bac k to at least the time of ancien t Greece. The m ythical
Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, figures Pygmalion,
Daedalus,
2004 ; Spark
Sparkesand Hephaestus
es, 1996 ; Tandy, ma 1997y ).
all be interpreted as legendary inventors, and
Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin,
2004When
; Spark programmable
es, 1996; Tandy computers
, 1997). were first conceiv conceived,
ed, people wondered whether
they migh
mightt become intelligen
intelligent,t, ov
over
er a hundred years before one was built (Lo Lov
velace
elace,,
1842When
1842).). Todayprogrammable
oday,, artificial intel computers
intelligenc
ligenc w ere first conceiv ed, p
ligencee (AI) is a thriving field with man eople w ondered
many whether
y practical
they migh t b ecome intelligen
applications and active researc t,
research over a hundred
h topics. We lo look years b efore
ok to intelligen one was
intelligentt softw are to (automate
softwarebuilt Lovelace,
1842 ). Tlab
routine odayor,, artificial
labor, understand intel ligenc
sp eech eor(AI)
speech is a thriving
images, mak field withinman
makee diagnoses y practical
medicine and
applications
supp
support and active researc
ort basic scientific research. h topics. W e lo ok to intelligen t softw are to automate
routine labor, understand speech or images, make diagnoses in medicine and
suppInortthebasic
earlyscientific
da
days
ys of research.
artificial in intelligence,
telligence, the field rapidly tackled and solved
problems that are intellectually difficult for human beings but relativ relativelyely straight-
forwIn
forward
ard the
forearly da ys of artificial
computers—problems in telligence,
that can b e the field
describ
described
edrapidly
by a tackled
list of and solved
formal, math-
problems that are intellectually difficult for
ematical rules. The true challenge to artificial intelligence provhuman b eings but relativ
proved ely straight-
ed to be solving
forw ard for computers—problems that can b
the tasks that are easy for people to perform but hard for people e describ ed by a list of formal, math-
to describ
describe e
ematical rules. The true challenge
formally—problems that we solve intuitiv to artificial
intuitivelyely intelligence prov ed to
ely,, that feel automatic, like recognizingb e solving
the
sp
spok
oktasks
oken that are easy
en words or faces in images. for p eople to p erform but hard for people to describe
formally—problems that we solve intuitively, that feel automatic, like recognizing
This book is ab about
out a solution to these more intuitiv intuitivee problems. This solution is
spoken words or faces in images.
to allow computers to learn from exp experience
erience and understand the world in terms of a
This
hierarc
hierarch b o ok is ab out a solution to
hy of concepts, with each concept defined these more intuitiv e problems.
in terms This solution
of its relation to simpleris
to allow computers
concepts. By gathering to learn from exp
knowledge erience
from and understand
experience, the world
this approac
approach h av in terms
avoids
oids of a
the need
hierarc
for humanhy ofop concepts,
operators
erators to with each concept
formally sp ecify defined
specify all of the in terms
knowledgeof its that
relation
the to simpler
computer
concepts.
needs. TheBy gathering
hierarc
hierarch knowledge
hy of concepts fromthe
allows experience,
computerthis approac
to learn h avoids concepts
complicated the need
for h uman op erators to formally sp ecify all of the
by building them out of simpler ones. If we draw a graph showing how theseknowledge that the computer
needs. The hierarchy of concepts allows the computer to learn complicated concepts
by building them out of simpler ones. 1If we draw a graph showing how these
1
CHAPTER 1. INTRODUCTION

concepts are built on top of eac eachh other, the graph is deep, with man many y lay
layers.
ers. For
this reason, we call this approach to AI de deep
ep lelearning
arning
arning..
concepts are built on top of each other, the graph is deep, with many layers. For
Man
Many y of the early successes of AI took place in relativ relativelyely sterile and formal
this reason, we call this approach to AI deep learning.
en
environmen
vironmen
vironments ts and did not require computers to ha havve muc uchh kno
knowledge
wledge ab about
out
Man
the world. F y of the
For early successes of AI
or example, IBM’s Deep Blue chess-pla took place in
chess-playing relativ ely sterile and
ying system defeated world formal
cen vironmen
hampion ts and
Garry did not
Kasparo
Kasparov v inrequire
1997 (Hsu computers
, 2002). to haveismofuccourse
Chess h knowledge about
a very simple
wthe world.
orld, con For example,
containing
taining only sixt IBM’s
y-fourDeep
sixty-four lo Blue and
locations
cations chess-plathirt ying
thirty-t
y-t
y-tw wosystem
pieces defeated
that can w orlde
mov
move
champion
in only rigidlyGarrycircumscrib
Kasparov in
circumscribed ed 1997
ways.(Hsu , 2002). aChess
Devising is of course
successful chess astrategy
very simple is a
w orld, con taining
tremendous accomplishmen only
accomplishment, sixt y-four
t, but the challenge is not due to the difficulty ofe
lo cations and thirt y-t w o pieces that can mov
in only rigidly
describing the setcircumscrib ed ways.
of chess pieces and Devising
allo
allowwableamov successful
moves es to thechess strategyChess
computer. is a
tremendous
can be completelyaccomplishmen
describ
described ed t,bybut the brief
a very challenge
list ofiscompletely
not due to the difficulty
formal rules, easily of
describing
pro
provided
vided ahead the set of chess
of time by thepieces and allowable moves to the computer. Chess
programmer.
can be completely described by a very brief list of completely formal rules, easily
Ironically
Ironically,, abstract and formal tasks that are among the most difficult mental
provided ahead of time by the programmer.
undertakings for a human being are among the easiest for a computer. Computers
ha
havveIronically
long been , abstract
able toand formal
defeat even tasks
thethat
bestare humanamong the most
chess play
player,difficult
er, but are mental
only
undertakings
recen
recently for a h uman b eing are among the easiest
tly matching some of the abilities of average human beings to recognize ob for a computer. Computers
objects
jects
ha v
or sp e long
speech. b een able
eech. A person’s everyda to defeat
everyday even the b est human chess
y life requires an immense amount of kno play er, but are
knowledge only
wledge
recen
ab outtly
about thematching
world. Muchsome of of the
thisabilities
kno
knowledge
wledge of aviserage
sub human
subjectiv
jectiv
jective beings
e and to recognize
intuitiv
intuitive, ob jects
e, and therefore
or speech.
difficult to A person’s in
articulate everyda
a formal y lifewarequires
y. Computersan immense need amount
to capture of kno
thiswledge
same
ab
kno out
knowledge the world. Much
wledge in order to beha of this
ehav kno wledge
ve in an in is
intelligen sub
telligen jectiv e and intuitiv e,
telligentt way. One of the key challenges inand therefore
difficult
artificial in to articulate
intelligence in a formal w
telligence is how to get this informal a y. Computers kno need into
knowledge
wledge to capture this same
a computer.
knowledge in order to behave in an intelligent way. One of the key challenges in
Sev
Several
eral artificial in intelligence
telligence pro projects
jects hav
havee sought to hard-cohard-code de knowledge ab about
out
artificial intelligence is how to get this informal knowledge into a computer.
the worl
world d in formal languages. A computer can reason ab about
out statements in these
Sev eral artificial intelligence pro jects hav
formal languages automatically using logical inference rules. This e sought to hard-co de knowledge
is kno
knownwn as abthe
out
the
know worl
knowle le
ledge d in formal languages.
dge base approach to artificial in A computer can
intelligence. reason ab
telligence. None of these pro out statements
projects in these
jects has led to
aformal
ma jorlanguages
major success. One automatically
of the most using
famouslogicalsuchinference
pro
projects
jectsrules.
is CycThis is knoand
(Lenat wn asGuha the,
know).
1989
1989). ledgeCyc base approach
is an inference to artificial
engine and intelligence.
a database None of of these pro jects
statements in a has led to
language
a ma jor
called success.
CycL. These One of the most
statements famous
are en
entered
tered by such pro jects
a staff is Cycsup
of human (Lenat
supervisors. andItGuha
ervisors. is an,
1989
un ).
unwieldy Cyc
wieldy pro is
process.an inference engine and a database of statements
cess. People struggle to devise formal rules with enough complexity in a language
called
to CycL. These
accurately describ
describe statements
e the world. are en
Fortered by a staff
example, Cycoffailed
human to sup ervisors. aIt story
understand is an
unwieldy
ab out a ppro
about ersoncess. People
named struggle
Fred shaving to in
devise formal rules
the morning (Lindewith enough
, 1992 ). Itscomplexity
inference
to accurately describ e the world. F or example,
engine detected an inconsistency in the story: it knew that people do not Cyc failed to understand a story
ha
havve
ab out a p erson named F red shaving in the morning
electrical parts, but because Fred was holding an electric razor, it believed the ( Linde , 1992 ). Its inference
engine
en
entit
tit
tity
y “F detected an inconsistency
“FredWhileShaving”
redWhileShaving” contained in the story: parts.
electrical it knewIt that people
therefore askdo
ednot
asked have
whether
Felectrical
red was stillparts,a pbut
erson because
while he Fred
waswshaas holding
ving. an electric razor, it believed the
shaving.
entity “FredWhileShaving” contained electrical parts. It therefore asked whether
The difficulties faced by systems relying on hard-coded kno knowledge
wledge suggest that
Fred was still a person while he was shaving.
AI systems need the ability to acquire their own kno knowledge,
wledge, by extracting patterns
from Thera
raw wdifficulties
data. This faced by systems
capabilit
capability relyingason
y is known hard-coded
machine le knowledge
learning
arning
arning. . The in suggest
intro
tro
troduction that
duction
AI systems need the ability to acquire their own knowledge, by extracting patterns
from raw data. This capability is known 2 as machine learning. The intro duction
CHAPTER 1. INTRODUCTION

of mac
machinehine learning allo allowwed computers to tackle problems inv involving
olving knowledge
of the real world and mak makee decisions that app appearear subsubjective.
jective. A simple machine
of machine
learning learningcalled
algorithm allowloed computers
logistic
gistic regr
gression
ession to can
tackle problems
determine involving
whether knowledge
to recommend
of the real
cesarean world
deliv
deliveryery and
(Mor-Y mak
Mor-Yosef e decisions
osef et al. that
al.,, 1990 ). app ear submachine
A simple jective. learning
A simplealgorithm
machine
learning algorithm called lo gistic re gression
called naive Bayes can separate legitimate e-mail from spam e-mail. can determine whether to recommend
cesarean delivery (Mor-Yosef et al., 1990). A simple machine learning algorithm
The performance of these simple machine learning algorithms dep depends
ends heavily
called naive Bayes can separate legitimate e-mail from spam e-mail.
on the repr epresentation
esentation of the data they are given. For example, when logistic
The p erformance
regression is used to recommend of these simple machine
cesarean deliv learning
deliveryery
ery,, the AI algorithms
system do dep
does es ends heavily
not examine
on the r epr esentation
the patient directly of the
directly.. Instead, the do data they
doctor are given. F or example,
ctor tells the system several pieces of relev when logistic
relevanan
antt
regression is used
information, suc
such htoasrecommend
the presence cesarean deliveryof, the
or absence AI system
a uterine scar.does not examine
Each piece of
the patient directly
information included . Instead, the doctor
in the represen
representation tellsofthe
tation thesystem
patient several piecesasofa relev
is known fe
featuranet.
atur
ature
information,
Logistic such learns
regression as the ho presence
how w eac
each h oforthese
absence of a of
features uterine scar. correlates
the patient Each piece of
with
information included
various outcomes. Ho in
Howwevthe
ever, represen tation
er, it cannot influence the waof the patient
way is known as
y that the features are a fe atur e.
Logistic in
defined regression
any wa wayylearns how eac
. If logistic h of thesewas
regression features
given of anthe MRI patient
scan correlates
of the patient, with
variousthan
rather outcomes.
the do Howev
doctor’s
ctor’s er, it cannot
formalized rep influence
report,
ort, it would the notwaybethat ablethe features
to mak
make are
e useful
defined in any
predictions. way. If pixels
Individual logisticinregression
an MRI scan washa given
hav an MRI correlation
ve negligible scan of thewith patient,
an
any
y
rather than the do ctor’s formalized
complications that might occur during delivery rep ort,
delivery..it would not be able to mak e useful
predictions. Individual pixels in an MRI scan have negligible correlation with any
This depdependence
endence on represenrepresentations
tations is a general phenomenon that app appearsears
complications that might occur during delivery.
throughout computer science and even daily life. In computer science, opera-
tionsThis
suc
such dep
h asendence
searching on arepresen
collection tations
of datais a cangeneral
pro
proceed phenomenon
ceed exp
exponentially that faster
onentially appears if
throughout computer science and
the collection is structured and indexed intelligen even daily life.
intelligently tlyIn
tly.. P computer
People science,
eople can easily perform opera-
tions such as
arithmetic on searching a collection
Arabic numerals, but of find data can proceed
arithmetic on Romanexponentially
numerals faster
muc if
uch
h
the collection is structured and indexed intelligen
more time-consuming. It is not surprising that the choice of represen tly . P eople can easily
representation p erform
tation has an
arithmetic on Arabic numerals,
enormous effect on the performance of mac but find arithmetic
machine on Roman n
hine learning algorithms. For a simple umerals much
more time-consuming.
visual example, see Fig. 1.1. It is not surprising that the choice of represen tation has an
enormous effect on the performance of machine learning algorithms. For a simple
Man
Many
visual y artificial
example, see intelligence
Fig. 1.1. tasks can be solv solveded by designing the righ rightt set of
features to extract for that task, then providing these features to a simple machine
Manyalgorithm.
learning artificial intelligence
For example, tasks can be
a useful solvedforbysp
feature designing
speak
eak
eaker er iden the right set
identification
tification of
from
features to extract for that task, then providing these
sound is an estimate of the size of speaker’s vocal tract. It therefore giv features to a simple
gives machine
es a strong
learning
clue as toalgorithm.
whether the Forsp example,
eaker is aaman,
speaker usefulwoman,
featureorforchild. speaker identification from
sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong
Ho
How wev
ever,
er, for man
many y tasks, it is difficult to know what features should be extracted.
clue as to whether the speaker is a man, woman, or child.
For example, supp suppose ose that we would lik likee to write a program to detect cars in
Ho w ev er, for man y
photographs. We know that cars hatasks, it is difficult
hav to know
ve wheels, so what
we might features
like should
to use b e extracted.
the presence
Fora example,
of wheel as asupp ose that
feature. we would lik
Unfortunately
Unfortunately, , ite istodifficult
write a to program
describeto
describ detect what
e exactly cars in a
photographs.
wheel lo looks W e know that cars ha v e wheels, so w e might
oks like in terms of pixel values. A wheel has a simple geometric shap like to use the presence
shapee but
of a wheel as a feature. Unfortunately , it is difficult
its image may be complicated by shadows falling on the wheel, the sun glaring to describ e exactly whatoffa
wheel lo oks like in terms of pixel v alues.
the metal parts of the wheel, the fender of the car or an ob A wheel has a simple
object geometric shap
ject in the foreground e but
its image may
obscuring partbof e complicated
the wheel, and by soshadows
on. falling on the wheel, the sun glaring off
the metal parts of the wheel, the fender of the car or an ob ject in the foreground
obscuring part of the wheel, and so on. 3
CHAPTER 1. INTRODUCTION

Cartesian coordinates Polar coordinates

Cartesian coordinates Polar coordinates

µ
x r

Figure 1.1: Example of differen


differentt represen
representations:
tations: suppose we wan antt to separate ttwwo
categories of data by dra
drawing
wing a line betw
between
een them in a scatterplot. In the plot on the left,
Figure
w 1.1: some
e represent Example
data of differen
using t represen
Cartesian co tations:and
coordinates,
ordinates, suppose weis w
the task antossible.
imp to separate
impossible. two
In the plot
on the right, we represent the data with p olar coordinates and the task b ecomes simpleleft,
categories of data by dra wing a line betw een them in a scatterplot. In the plot on the to
we represent
solv
solve some data
e with a vertical using
line. Cartesian
(Figure pro co ordinates,
produced
duced in collaband the task
collaboration
oration withis David
imp ossible. In the
Warde-F plot
arde-Farley)
arley)
on the right, we represent the data with p olar coordinates and the task b ecomes simple to
solve with a vertical line. (Figure pro duced in collab oration with David Warde-Farley)
One solution to this problem is to use machine learning to disco discov ver not only
the mapping from represen representation
tation to output but also the representation itself.
ThisOne solution
approach to this as
is known problem
repr is to usele
epresentation
esentation machine
learning
arning learning representations
arning.. Learned to discover not often only
the mapping
result in muc uchhfrom
betterrepresen tation tothan
performance output
can but also the representation
be obtained with hand-designed itself.
This approach
represen tations. They also allow AI systems to rapidly adapt to new tasks, often
representations. is known as r epresentation learning . Learned representations with
result
minimal in hm uch bin
uman etter
interv
terv
terven
enp erformance
ention.
tion. A than can
representation b e obtained
learning with
algorithm hand-designed
can discov
discover er a
represen
go
goo tations. They also allow
od set of features for a simple task in min AI systems to
minutes, rapidly adapt to new
utes, or a complex task in hours to tasks, with
minimal
mon
months. h uman in terv en tion. A representation
ths. Manually designing features for a complex learning taskalgorithm
requires acan discov
great dealerofa
go o d set of features for
human time and effort; it can tak a simple task in min utes,
takee decades for an en or
entire a complex
tire communit
community task in hours
y of researc
researchers. to
hers.
months. Manually designing features for a complex task requires a great deal of
The quintessen
quintessential tial example of a represen tation learning algorithm is the au-
representation
human time and effort; it can take decades for an entire community of researchers.
to
toenc
enc
enco oder
der.. An auto
autoenco
enco
encoder
der is the comcombination
bination of an enc enco oder function that con conv verts
The quintessen tial example of a represen
the input data into a different representation, and a de tation learning algorithm
deccoder function that con is the
conv au-
verts
toenc oder . An auto
the new representation bac enco der
back is the com bination of an
k into the original format. Auto enc o der
Autoencofunction
enco
encoders that con
ders are trained to verts
the input
preserv data
preservee as muc uchinto
h information as possible when an input is run throughthat
a different representation, and a de co der function the con
enco verts
encoderder
the new
and thenrepresentation
the deco
decoder, backare
der, but into thetrained
also originaltoformat.
make the Auto enco
new ders are trained
representation havtoe
have
preserv
variouseniceas mprop
uch information
properties.
erties. Different as possible
kinds when
of auto anenco
autoenco input
encodersders is aim
run through
to achiev the
achieve encoder
e different
and then
kinds of prop the deco
properties.
erties. der, but are also trained to make the new representation have
various nice properties. Different kinds of autoencoders aim to achieve different
When
kinds designing
of prop erties. features or algorithms for learning features, our goal is usually
to separate the factors of variation that explain the observed data. In this context,
When
we use the designing
word “factors”features or algorithms
simply to refer to for learning
separate features,
sources our goalthe
of influence; is usually
factors
to separate the factors of variation
are usually not combined by multiplication. Suc that explain
Suchthe observed data. In this
h factors are often not quantities context,
we use the word “factors” simply to refer to separate sources of influence; the factors
are usually not combined by multiplication. 4 Such factors are often not quantities
CHAPTER 1. INTRODUCTION

that are directly observed. Instead, they may exist either as unobserv unobserved ed obobjects
jects
or unobserved forces in the ph physical
ysical world that affect observ observable
able quan
quantities.
tities. They
that
ma
may are directly
y also exist asobserved.
constructsInstead,in the hthey uman may mind exist either
that pro as unobserv
provide
vide ed objects
useful simplifying
or unobservedorforces
explanations in the
inferred physical
causes of the world
observthat
observed ed affect
data. observ
They able
can bquan tities. They
e thought of as
ma y also exist as constructs in the h uman mind
concepts or abstractions that help us make sense of the rich variabilit that pro vide useful
ariability simplifying
y in the data.
explanations
When analyzing or inferred
a sp eechcauses
speech recording,of thethe observ
factors ed of
data. They can
variation includebe thought
the sp
speak
eakofer’s
as
eaker’s
concepts
age, theiror abstractions
sex, their accent thatandhelp theus makethat
words sensetheyof the
are rich
sp variabilit
speaking.
eaking. Wheny inanalyzing
the data.
When
an image analyzing
of a car,a thespeech recording,
factors the factors
of variation includeofthe variation
positioninclude
of the the
car, sp
itseak er’s
color,
age, their sex, their accent
and the angle and brightness of the sun. and the words that they are sp eaking. When analyzing
an image of a car, the factors of variation include the position of the car, its color,
andAthe ma
majorjor source
angle of difficult
difficulty
and brightness yofin the
many sun.real-w
real-worldorld artificial intelligence applications
is that many of the factors of variation influence ev every
ery single piece of data we are
able to observe. The individual pixels in an image of a red intelligence
A ma jor source of difficult y in many real-w orld artificial mightt bapplications
car migh e very close
is that many of the
to black at night. The shap factors of v ariation influence
shapee of the car’s silhouette dep ev ery single
ends on the viewingwe
depends piece of data are
angle.
able toapplications
Most observe. The individual
require pixels in anthe
us to disentangle image of a of
factors redvariation
car mighand t bediscard
very closethe
to black at night. The
ones that we do not care about. shap e of the car’s silhouette dep ends on the viewing angle.
Most applications require us to disentangle the factors of variation and discard the
Of course, it can be very difficult to extract such high-level, abstract features
ones that we do not care about.
from ra raww data. Man Many y of these factors of variation, such as a sp speak
eak
eaker’s
er’s accen
accent, t,
can Ofbe course,
iden
identified it can
tified onlybeusing
very sophisticated,
difficult to extract nearly such high-level,
human-lev
human-level abstract features
el understanding of
fromdata.
the raw Whendata. it Man y of these
is nearly factors to
as difficult of obtain
variation, such as a speak
a representation er’s
as to accen
solve thet,
can be iden
original tified representation
problem, only using sophisticated,
learning do doesnearly
es not, athuman-lev
first glance, el understanding
seem to help us. of
the data. When it is nearly as difficult to obtain a representation as to solve the
De
Deep
original epproblem,
le
learning
arning solvsolves
es this central
representation problem
learning does in not,
represen
representation
tation
at first learning
glance, seembytointroduc-
help us.
ing represen
representations
tations that are expressed in terms of other, simpler representations.
Deep Delearning
ep learning solves
allows thethis central problem
computer to build in represen
complex tation learning
concepts by introduc-
out of simpler con-
ing represen tations that are expressed
cepts. Fig. 1.2 shows how a deep learning system can represen in terms of other, simpler representations.
representt the concept of an
Deep learning allows the computer to build
image of a person by combining simpler concepts, such as corners complex concepts out ofandsimpler con-
contours,
cepts.
whic
which Fig.in1.2
h are shows
turn how in
defined a deep
termslearning
of edges. system can represent the concept of an
image of a person by combining simpler concepts, such as corners and contours,
whicThe
h are quin
quintessen
in tessen
tessential
tial example
turn defined in terms of aofdeep
edges. learning mo model
del is the feedforw
feedforwardard deep
net
netwwork or multilayer per ercceptr
eptron on (MLP). A multila ultilayyer perceptron is just a mathe-
The quin tessen tial example of a deep
matical function mapping some set of input values to output learning mo del is the feedforw
values. The ard deep
function
netformed
is work orbymultilayer
comp
composing
osing permany
ceptron (MLP).
simpler A multila
functions. Weyercanperceptron
think of each is just a mathe-
application
matical
of function
a different mapping some
mathematical functionset of as input
pro
providingvalues
viding to output
a new values. The
representation function
of the input.
is formed by composing many simpler functions. We can think of each application
of aThe idea of
different learning thefunction
mathematical right represen
representation
as protation
viding fora newthe representation
data provides one of thepersp
erspec-
ec-
input.
tiv
tivee on deep learning. Another persp erspective
ective on deep learning is that depth allows the
The idea
computer of learning
to learn the right
a multi-step represen
computer tation forEac
program. theh data
Each lay erprovides
layer one persp
of the represen
representation ec-
tation
tive b
can one deep
thought learning. Another
of as the state pofersp theective on deepmemory
computer’s learningafter
is that depth allows
executing anotherthe
computer to learn a m ulti-step
set of instructions in parallel. Net computer
Netw program. Eac h lay er of
works with greater depth can execute more the represen tation
can b e thought of as the state
instructions in sequence. Sequential instructionsof the computer’s memory offer great afterpow executing
ower er because another
later
set of instructions
instructions can refer in back
parallel.
to theNet worksofwith
results earliergreater depth can
instructions. executetomore
According this
instructions in sequence. Sequential instructions offer great power because later
instructions can refer back to the results5 of earlier instructions. According to this
CHAPTER 1. INTRODUCTION

Output
CAR PERSON ANIMAL
(object identity)

3rd hidden layer


(object parts)

2nd hidden layer


(corners and
contours)

1st hidden layer


(edges)

Visible layer
(input pixels)

Figure 1.2: Illustration of a deep learning mo model.


del. It is difficult for a computer to understand
the meaning of ra raww sensory input data, suc such h as this image represenrepresentedted as a collection
Figure
of pixel1.2: Illustration
values. of a deep
The function learningfrom
mapping mo del. It isofdifficult
a set pixels to forana computer
ob
object
ject idento tity
understand
identity is very
the meaning Learning
complicated. of raw sensory
or ev input data,
evaluating
aluating such as seems
this mapping this image represented
insurmountable as a collection
if tackled directly
directly..
of pixel
Deep values.resolves
learning The function mapping
this difficult
difficulty y by from a setthe
breaking of desired
pixels to an ob ject iden
complicated tity isinto
mapping verya
complicated. Learning or ev aluating
series of nested simple mappings, each describ this mapping
described seems insurmountable
ed by a different lay layer if tackled
er of the mo del. The.
model.directly
Deep learning
input resolves
is presented at thethis difficult
visible layery ,bysobreaking
named bthe ecausedesired complicated
it contains mappingthat
the variables intowea
series
are of to
able nested simple
observe. Then mappings,
a series ofeach describ
hidden ed by
layers a different
extracts layer ofabstract
increasingly the mo del. The
features
input is presented at
from the image. These lay the visible
layers layer
ers are called “hidden” b ecause their values are not given we
, so named b ecause it contains the v ariables that in
are able
the data;toinstead
observe.theThenmo
modela series
del hidden layers
mustofdetermine which extracts
conceptsincreasingly
are useful abstract features
for explaining
fromrelationships
the the image. Thesein the lay ers are data.
observed called The“hidden”
images b ecause
here are their values are not
visualizations given
of the in
kind
the data; instead the mo del must
of feature represented by each hidden unit. Giv determine which
Given concepts are useful
en the pixels, the first lay for
layer explaining
er can easily
the
iden relationships
identify
tify edges, by in the observed
comparing data. Theofimages
the brightness here are
neighboring visualizations
pixels. Given the first of the kind
hidden
of
layfeature
lay represented
er’s description of thebyedges,
each hidden
the second unit. Givenlay
hidden the
layer pixels,
er can thesearch
easily first lay
forercorners
can easily
and
iden tify edges, by comparing the brightness of neighboring pixels.
extended contours, which are recognizable as collections of edges. Given the second Given the first hidden
layer’s description
lay description of of the
the edges,
image the secondofhidden
in terms cornerslay er can
and easily the
contours, search forhidden
third cornerslayand
layer
er
extended
can detectcontours, which
entire parts ofare
sp recognizable
specific
ecific ob jects,asbycollections
objects, finding sp ofecific
edges.
specific Given the
collections ofsecond
contourshidden
and
layer’s description
corners. Finally of description
Finally,, this the image in of terms of corners
the image in terms andofcontours,
the ob jectthe
object third
parts it hidden
containslay er
can
can
b detect
e used entire parts
to recognize theofob sp ecific
objects
jects ob jects,
present by finding
in the image. sp ecific reproduced
Images collections of withcontours and
p ermission
corners.
from Finally
Zeiler and, Fthis
ergusdescription
(2014). of the image in terms of the ob ject parts it contains can
b e used to recognize the ob jects present in the image. Images reproduced with p ermission
from Zeiler and Fergus (2014).
6
CHAPTER 1. INTRODUCTION

Element
Set  Element
Set

+
+

⇥ ⇥ ⇥ Logistic Logistic

+ Regression Regression

 +
w1 x1 w2 x2 w x



Figure 1.3: Illustration of computational graphs mapping an input to an output where
eac
each
h no
output
node
de p erforms an op
Figure 1.3: Illustration
but dep
depends
each computation
ends on the
no de p erformsdepicted

operation.
definition of what
an op eration.

eration. Depth is the length of the longest path from input to
of computational graphs mappingaan
constitutes input to
p ossible an output where
computational step.
The in theseDepth is the
graphs length
is the of the
output of alongest
logisticpath from input
regression mo to
model,
del,
output
T but dep ends on the definition of what constitutes a p ossible computational step.
σ(w x ), where σ is the logistic sigmoid function. If we use addition, multiplication and
The computation
logistic sigmoids asdepicted in these
the elemen
elements graphs
ts of our is the output
computer of athen
language, logistic
this regression
mo del has mo
model del,
depth
σ(w xIf),we
three. viewσ logistic
where is the logistic sigmoid
regression as anfunction. If we use
element itself, thenaddition,
this mo
modelmultiplication
del and
has depth one.
logistic sigmoids as the elements of our computer language, then this mo del has depth
three. If we view logistic regression as an element itself, then this mo del has depth one.
view of deep learning, not all of the information in a lay layer’s
er’s activ
activations
ations necessarily
enco
encodes
des factors of variation that explain the input. The representation also stores
view of
state deep learning,
information not alltoofexecute
that helps the information
a programinthat a lay er’smake
can activsense
ationsofnecessarily
the input.
enco des factors of variation that explain
This state information could be analogous to a coun the input. The
counter representation also
ter or pointer in a traditional stores
state information
computer program. thatIt helps to execute
has nothing a program
to do with the thatcon
contencan
ten
tent make
t of the sense
inputofsp the input.,
specifically
ecifically
ecifically,
This state information
but it helps the mo model could b e analogous
del to organize its processing. to a coun ter or p ointer in a traditional
computer program. It has nothing to do with the content of the input specifically,
There are tw two
o main wa ways
ys of measuring the depth of a mo model.
del. The first view is
but it helps the model to organize its processing.
based on the num umber
ber of sequen
sequential
tial instructions that must be executed to ev evaluate
aluate
There
the arc are
architecture.two main wa ys of measuring the depth of a mo
hitecture. We can think of this as the length of the longest path throughdel. The first view is
based
a flo
flow on the n um ber
w chart that describ of sequen
describes tial instructions that
es how to compute each of the mo must b e executed
model’s to ev
del’s outputs givenaluate
theinputs.
its architecture.
Just asWtw e ocan
two think
equiv
equivalen
alen
alentoft computer
this as theprograms
length ofwillthe hav
longest
have path through
e different lengths
a
depflo w
dependingchart that describ es how to compute each of the
ending on which language the program is written in, the same function may mo del’s outputs given
be
its
dra inputs.
drawn
wn as a flo Just
flow as tw o equiv
wchart with differen alen t computer
differentt depths dep programs
depending will
ending on whic hav
which e different lengths
h functions we allow
dep ending on which language the
to be used as individual steps in the flo program
flowwchart. Fig. 1.3 illustratesfunction
is written in, the same ho
how w thismay be
choice
dralanguage
of wn as a flo wcgive
can hart twwith
two differentmeasurements
o different depths depending on whic
for the sameharchitecture.
functions we allow
to be used as individual steps in the flowchart. Fig. 1.3 illustrates how this choice
Another approac
approach, h, used by deep probabilistic mo models,
dels, regards the depth of a
of language can give two different measurements for the same architecture.
mo
model
del as being not the depth of the computational graph but the depth of the
graphAnother approac
describing howh,concepts
used by deep probabilistic
are related to eachmo dels, In
other. regards the depth
this case, the depthof a
mothe
of del as
flow bchart
eing not
flowchart the computations
of the depth of the computational
needed to compute graph the but representation
the depth of the of
graph describing how concepts are related to each other. In this case, the depth
of the flowchart of the computations needed 7 to compute the representation of
CHAPTER 1. INTRODUCTION

eac
eachh concept ma may y be muc uch h deep
deeperer than the graph of the concepts themselv themselves. es.
This is because the system’s understanding of the simpler concepts can be refined
eac
giv h concept
given
en information mayab beout
about muc theh more
deepercomplex
than the graph of
concepts. Forthe concepts
example, anthemselv
AI system es.
This is because
observing an image the of
system’s understanding
a face with one eye in of the simpler
shadow ma
may concepts
y initially cansee
only beone
refined
eye.
giv en information ab out
After detecting that a face is presenthe more
present,complex concepts. F or example,
t, it can then infer that a second ey an AI system
eyee is probably
observing
presen
present t asanwell.imageIn of
thisa face
case,withthe one
grapheyeof in concepts
shadow ma y initially
only includes only
twoseelayone
layers—aeye.
ers—a
After
la
lay detecting
yer for ey
eyes
es and thata alay
face
er is
layer forpresen t, it canthe
faces—but then inferofthat
graph a second eyeincludes
computations is probably2n
presen
la
lay t as well. In this case, the graph
yers if we refine our estimate of each concept giv of concepts
given only includes
en the other n times. tw o lay ers—a
layer for eyes and a layer for faces—but the graph of computations includes 2n
Because it is not alw alwa ays clear whic
which h of these tw twoo views—the depth of the
layers if we refine our estimate of each concept given the other n times.
computational graph, or the depth of the probabilistic mo modeling
deling graph—is most
relev Because
relevant, it is not
ant, and because differen alw ays clear whic
differentt people cho h of
hoose these tw o views—the
ose different sets of smallest depth of the
elemen
elements ts
computational
from whic
which graph, or the depth of the probabilistic mo
h to construct their graphs, there is no single correct value for the deling graph—is most
relev ant, and
depth of an arc because
architecture,
hitecture,differen
just taspthere
eople is cho noose different
single correct sets of smallest
value elemenof
for the length ts
afrom which program.
computer to construct Northeir graphs,
is there there isab
a consensus noout
about single
ho
how w correct
muc
much valueafor
h depth mo the
model
del
depth of an arc hitecture,
requires to qualify as “deep.” Ho just as
How there
wev
ever, is no single correct v alue for
er, deep learning can safely be regarded as the the length of
a computer
study of mo program.
models
dels Nor isinv
that either there
olveaaconsensus
involve greater amoun aboutt of
amount how compmucosition
h depthof alearned
composition model
requires to qualify as “deep.” Ho w ev
functions or learned concepts than traditional mac er, deep learning
machine can safely
hine learning does.b e regarded as the
study of models that either involve a greater amount of composition of learned
To summarize, deep learning, the sub subjectject of this book, is an approach to AI.
functions or learned concepts than traditional machine learning does.
Sp
Specifically
ecifically
ecifically,, it is a type of machine learning, a technique that allow allowss computer
To summarize,
systems to impro improv deep learning,
ve with exp experience the sub
erience and data. Aject of this
ccording to the approach
b
Accordingo ok, is an authors oftothis AI.
Sp ecifically
book, macmachine, it is a type of machine learning,
hine learning is the only viable approac a
approach technique that allow s
h to building AI systems that computer
systems
can op to impro
operate
erate ve with exp
in complicated, erience
real-w orldand
real-world environdata.
environmen men Ats.
ccording
ments. to the authors
Deep learning of this
is a particular
book,ofmac
kind mac hine
hinelearning
machine learningisthat the achiev
only viable
achieveses great approac
pow
power erh and
to building
flexibility AIbysystems
learning that
to
can op
represen erate in complicated,
representt the world as a nested hierarcreal-w orld
hierarch environ men ts. Deep learning
hy of concepts, with each concept defined in is a particular
kind of mac hine learning that achiev
relation to simpler concepts, and more abstract es great pow er and flexibility
representations by learning
computed in termsto
represen
of t the world
less abstract as Fig.
ones. a nested hierarchy of
1.4 illustrates theconcepts, with beach
relationship et
etw concept
ween thesedefined
differentin
relation
AI to simpler
disciplines. Fig.concepts,
1.5 gives and more abstract
a high-level schematicrepresentations
of ho
how w eachcomputed
works. in terms
of less abstract ones. Fig. 1.4 illustrates the relationship between these different
AI disciplines. Fig. 1.5 gives a high-level schematic of how each works.
1.1 Who Should Read This Bo
Book?
ok?

1.1 boWho
This ok can bShould
e useful forRead
a varietThis
ariety Book?
y of readers, but we wrote it with two main
target audiences in mind. One of these target audiences is universit university y students
This b o ok can b e useful for a
(undergraduate or graduate) learning abvariet y of
aboutreaders, but we wrote it with
out machine learning, including those two main
who
target audiences in mind. One of these target
are beginning a career in deep learning and artificial in audiences is universit
intelligence y
telligence researc students
research.
h. The
(undergraduate or graduate)
other target audience is softwlearning
software about machine
are engineers who do learning,
havee including
not hav a mac hinethose
machine who
learning
arestatistics
or beginning a career
backgr
background,
ound,in but
deepwan learning
want and artificial
t to rapidly acquire in telligence
one and begin researc
usingh. deep
The
other target
learning audience
in their pro is softw
product
duct are engineers
or platform. Deep who do not
learning hashav e a mac
already prohine
prov learning
ven useful in
or
manstatistics
manyy soft
softw backgr ound, but wan t to rapidly acquire one and
ware disciplines including computer vision, speech and audio pro b egin using deep
processing,
cessing,
learning in their product or platform. Deep learning has already proven useful in
many software disciplines including computer 8 vision, speech and audio processing,
CHAPTER 1. INTRODUCTION

Deep learning Example:


Shallow
Example: Example:
Example: autoencoders
Logistic Knowledge
MLPs
regression bases

Representation learning

Machine learning

AI

Figure 1.4: A Venn diagram showing how deep learning is a kind of represen
representation
tation learning,
whic
which
h is in turn a kind of mac
machine
hine learning, which is used for many but not all approaches
Figure 1.4: A V enn diagram showing how deep learning is a kind of represen
to AI. Each section of the Venn diagram includes an example of an AI technology tation learning,
technology. .
which is in turn a kind of machine learning, which is used for many but not all approaches
to AI. Each section of the Venn diagram includes an example of an AI technology.

9
CHAPTER 1. INTRODUCTION

Output

Mapping from
Output Output
features

Additional
Mapping from Mapping from layers of more
Output
features features abstract
features

Hand- Hand-
Simple
designed designed Features
features
program features

Input Input Input Input

Deep
Classic learning
Rule-based
machine
systems Representation
learning
learning

Figure 1.5: Flow


Flowcharts
charts showing how the differen
differentt parts of an AI system relate to eac
eachh
other within different AI disciplines. Shaded b oxes indicate comp
components
onents that are able to
Figurefrom
learn 1.5: data.
Flowcharts showing how the different parts of an AI system relate to each
other within different AI disciplines. Shaded b oxes indicate comp onents that are able to
learn from data.

10
CHAPTER 1. INTRODUCTION

natural language pro processing,


cessing, rob
robotics,
otics, bioinformatics and chemistrychemistry,, video games,
searc
searchh engines, online advertising and finance.
natural language processing, robotics, bioinformatics and chemistry, video games,
This book has been organized into three parts in order to best accommo accommodatedate a
search engines, online advertising and finance.
variety of readers. Part I introduces basic mathematical to tools
ols and machine learning
This b o ok has
concepts. Part II describbeen organized
describes into three parts in
es the most established deep learning order to best accommo
algorithms date
that area
vessen
ariety of readers.
essentially
tially solv
solved Part
ed tec I introduces
technologies.
hnologies. Partbasic mathematical
III describ
describes es moreto spols and machine
speculativ
eculativ
eculative e ideas learning
that are
concepts.
widely believP art
elieved I I describ
ed to be imp es the
important most established
ortant for future researc deep
research learning
h in deep learning.algorithms that are
essentially solved technologies. Part III describes more speculative ideas that are
Readers should feel free to skip parts that are not relev relevan
an
antt given their interests
widely believed to be important for future research in deep learning.
or background. Readers familiar with linear algebra, probability probability,, and fundamental
mac Readers
machine should
hine learning feel freecan
concepts to skip
skip Part
partsI,that are not relev
for example, whileanreaders
t given their interests
who just wan
antt
or background. Readers familiar with
to implement a working system need not read bey linear algebra,
eyond probability , and fundamental
ond Part II. To help choose which
mac hine learning concepts can skip
chapters to read, Fig. 1.6 provides a flow Part I ,
flowchart for example,
chart sho wingwhile
showing readers who
the high-level just want
organization
to implement
of the book. a w orking system need not read b eyond Part I I . To help choose which
chapters to read, Fig. 1.6 provides a flowchart showing the high-level organization
We do assume that all readers come from a computer science bac background.
kground. We
of the book.
assume familiarity with programming, a basic understanding of computational
We do assume
performance issues,that all readers
complexity come
theory
theory, , in from a computer
introductory
troductory lev
level science bac
el calculus andkground. We
some of the
assume familiarity
terminology of graph with programming,
theory
theory. . a basic understanding of computational
performance issues, complexity theory, introductory level calculus and some of the
terminology of graph theory.
1.2 Historical Trends in Deep Learning

1.2
It Historical
is easiest Trends
to understand in Deep
deep learning Learning
with some historical con
context.
text. Rather than
pro
providing
viding a detailed history of deep learning, we iden
identify
tify a few key trends:
It is easiest to understand deep learning with some historical context. Rather than
providing
• Deepa learning
detailed has
history
had of deepand
a long learning,
ric
rich we iden
h history
history, tifyhas
, but a few key
gone bytrends:
many names
reflecting different philosophical viewp
viewpoints,
oints, and has waxed and waned in
Deep learning
popularit
opularity
y. has had a long and rich history, but has gone by many names
• reflecting different philosophical viewpoints, and has waxed and waned in
• popularit
Deep y.
learning has become more useful as the amoun amountt of av
available
ailable training
data has increased.
Deep learning has become more useful as the amount of available training
• data has
Deep increased.
learning models ha
hav ve gro
grown
wn in size over time as computer hardware
and soft
softw
ware infrastructure for deep learning has improv
improved.
ed.
Deep learning models have grown in size over time as computer hardware
• and soft
Deep ware infrastructure
learning has solv
solved for deep complicated
ed increasingly learning hasapplications
improved. with increasing
accuracy over time.
Deep learning has solved increasingly complicated applications with increasing
• accuracy over time.

11
CHAPTER 1. INTRODUCTION

1. Introduction

Part I: Applied Math and Machine Learning Basics

3. Probability and
2. Linear Algebra
Information Theory

4. Numerical 5. Machine Learning


Computation Basics

Part II: Deep Networks: Modern Practices

6. Deep Feedforward
Networks

7. Regularization 8. Optimization 9. CNNs 10. RNNs

11. Practical
12. Applications
Methodology

Part III: Deep Learning Research

13. Linear Factor 15. Representation


14. Autoencoders
Models Learning

16. Structured 17. Monte Carlo


Probabilistic Models Methods

18. Partition
19. Inference
Function

20. Deep Generative


Models

Figure 1.6: The high-level organization of the b o ok. An arro


arrow
w from one chapter to another
indicates that the former chapter is prerequisite material for understanding the latter.
Figure 1.6: The high-level organization of the b o ok. An arrow from one chapter to another
12
indicates that the former chapter is prerequisite material for understanding the latter.
CHAPTER 1. INTRODUCTION

1.2.1 The Man


Manyy Names and Changing Fortunes of Neural Net-
works
1.2.1 The Many Names and Changing Fortunes of Neural Net-
works
We expect that many readers of this book ha
hav
ve heard of deep learning as an
exciting new technology
technology,, and are surprised to see a men mention
tion of “history” in a book
Wab e
about expect that many readers of this
out an emerging field. In fact, deep learning dates back b o ok ha ve heard of deep
to the learning
1940s. as Deepan
exciting new
learning onlytechnology
appeears to, and
app are surprised
be new, because it towas
see relatively
a mention unp of “history”
unpopular
opular for in several
a book
yab outpreceding
ears an emerging field. Inpfact,
its current opularit deep
opularity learning
y, and datesit back
because has goneto the 1940s. many
through Deep
learning
differentt only
differen names, appand
earshas to bonly
e new, because
recently it wascalled
become relatively
“deepunp opular for
learning.” The several
field
yhas
earsbeenpreceding its current p opularit y, and b
rebranded many times, reflecting the influence of differen ecause it has gone through many
differentt researchers
differen
and t names,
differen
different t persp and
erspectiv
ectivhas
ectives.es.only recently become called “deep learning.” The field
has been rebranded many times, reflecting the influence of different researchers
A comprehensive history of deep learning is bey eyond
ond the scope of this textb textbo ook.
and different perspectives.
Ho
How wevever,
er, some basic context is useful for understanding deep learning. Broadly
sp A
speaking, comprehensive
eaking, there hav havee bhistory
een three of deep
wa
wav veslearning is beyond
of developmen
development t ofthe
deep scope of thisdeep
learning: textb ook.
learn-
Howknown
ing ever, some
as cyb basic
cybernetics
erneticscontext
in theis1940s–1960s,
useful for understanding
deep learning known deep learning.
as conne Broadly
onnectionism
ctionism
speaking,
in there have band
the 1980s–1990s, een the
threecurren
wavest of
current developmen
resurgence t of deep
under learning:
the name deepdeep learn-
learning
ing
b known in
eginning as 2006.
cybernetics
This is in quantitativ
the 1940s–1960s,
quantitatively deep learning
ely illustrated in Fig. known
1.7. as connectionism
in the 1980s–1990s, and the current resurgence under the name deep learning
Some of the earliest learning algorithms we recognize to toda
da
day y were intended
beginning in 2006. This is quantitatively illustrated in Fig. 1.7.
to be computational mo models
dels of biological learning, i.e. mo models
dels of how learning
happ Some
happens ens or of could
the earliest
happ
happen enlearning algorithms
in the brain. we recognize
As a result, one of to danames
the y werethat intended
deep
to b e computational mo
learning has gone by is artificial neur dels of biological
neural learning, i.e. mo
al networks (ANNs). The corresp dels of how learning
corresponding
onding
happ
persp ens
erspectiv
ectivor could happ
ectivee on deep learning moen in the
models brain. As a result, one of the names
dels is that they are engineered systems inspired that deep
learning
b has gonebrain
y the biological by is(whether
artificialthe neur al networks
human brain or(ANNs).
the brain The correspanimal).
of another onding
perspectiv
While the ekinds
on deep learning
of neural moorks
netw
networksdels isusedthatforthey are engineered
machine learning systems
hav inspired
havee sometimes
byeen
b theused
biological brain (whether
to understand brain the human(Hinton
function brain orand the Shallice
brain of, another
1991), they animal).
are
While the kinds of neural
generally not designed to be realistic mo netw orks used
models for machine learning hav
dels of biological function. The neurale sometimes
bersp
p een used etoonunderstand
erspectiv
ectiv
ective deep learning brain function
is motiv
motivatedated(Hinton
by twtwoo and
mainShallice
ideas. ,One 1991idea
), they are
is that
generally
the not designed
brain provides a pro
proofto
of bby
e realistic
example mo thatdels of biological
intelligen
intelligent t behavior function. The neural
is possible, and a
p ersp ectiv e on
conceptually straightforwdeep learning
straightforward is motiv ated
ard path to building in b y tw o main
intelligence ideas.
telligence is to rev One
erse engineerthat
reverse idea is the
the brain provides a pro of by example that intelligen
computational principles behind the brain and duplicate its functionalit t b ehavior is p ossible,
functionality and
y. Another a
conceptually
p ersp ectivee isstraightforw
erspectiv
ectiv that it would ardbpath to building
e deeply interestingintelligence is to revthe
to understand ersebrain
engineer the
and the
computational
principles that principles
underlie human behindintelligence,
the brain and so duplicate its functionalit
machine learning mo delsy.that
models Another
shed
p ersp
ligh ectiv e is that
lightt on these basic scien it w ould
scientific b e deeply interesting to understand
tific questions are useful apart from their abilit the brain
ability and
y to solv thee
solve
principles that underlie
engineering applications. human intelligence, so machine learning mo dels that shed
light on these basic scientific questions are useful apart from their ability to solve
The mo modern
dern term “deep learning” go goeses beyond the neuroscientific persp erspective
ective
engineering applications.
on the curren
currentt breed of mac machine
hine learning models. It app appeals
eals to a more general
The mo dern term “deep learning”
principle of learning multiple levels of comp go esosition,, which can be appliedpin
b
ompositioneyond
osition the neuroscientific ersp ective
machine
on the curren
learning framew t
frameworks breed of mac hine learning models.
orks that are not necessarily neurally inspired. It app eals to a more general
principle of learning multiple levels of composition, which can be applied in machine
learning frameworks that are not necessarily neurally inspired.
13
CHAPTER 1. INTRODUCTION

0.000250
Frequency of Word or Phrase

cyb
cybernetics
ernetics
0.000200
(connectionism + neural net
netw
works)
cyb ernetics
0.000150 (connectionism + neural networks)

0.000100

0.000050

0.000000
1940 1950 1960 1970 1980 1990 2000
Year

Figure 1.7: The figure shows tw twoo of the three historical wa wav ves of artificial neural nets
researc
research,h, as measured by the frequency of the phrases “cyb “cybernetics”
ernetics” and “connectionism” or
Figure net
“neural 1.7:works”
netw The figure shows
according twoogle
to Go of Bo
Google theoks
Books three
(thehistorical
third wawav vwa
e isvto
esoofrecent
too artificial neural
to app ear). nets
appear). The
researc
first h,easstarted
wav
ave measured
withbycybernetics
the frequency of the
in the phrases “cyb
1940s–1960s, ernetics”
with and “connectionism”
the developmen
development t of theoriesor
“neural networks”
of biological according
learning to Go
(McCullo
McCulloch chogle
andBo oks ,(the
Pitts 1943third
; Hebb wa,v1949
e is to
) oand
recent to app ear). The
implementations of
first w av
the first mo e started
models with cybernetics in the
dels such as the p erceptron (Rosen 1940s–1960s,
Rosenblatt with
blatt, 1958) allo the
allowing developmen t of theories
wing the training of a single
of biological
neuron. learning
The second wa (ve
waveMcCullo
startedchwith
andthePitts , 1943; Hebb
connectionist , 1949) and
approach implementations
of the 1980–1995 p erio of
eriod,
d,
the first
with bac mo dels such as(Rumelhart
back-propagation
k-propagation the p erceptron
et al.(,Rosen
1986ablatt
) to ,train
1958a) allo wing
neural theork
netw training
network of aorsingle
with one tw
twoo
neuron.
hidden la The
layers. second wa ve started with
yers. The current and third wa the
wav connectionist approach of the 1980–1995
ve, deep learning, started around 2006 (Hinton p erio d,
with
et al.bac k-propagation
, 2006 (Rumelhart
; Bengio et al. et al., 1986a
, 2007; Ranzato et al.,)2007a
to train a neural
), and is justnetw
no
nowwork with
app oneinorbtw
appearing
earing ooko
hidden
form as laofyers.
2016.The
Thecurrent
other and
tw
twoo third
waveswa ve, deepapp
similarly learning,
eared instarted
appeared around
b o ok form muc 2006
uch (Hinton
h later than
et al.corresp
the , 2006;onding
corresponding et al., 2007
Bengioscientific ; Ranzato
activity et al., 2007a), and is just now app earing in b ook
o ccurred.
form as of 2016. The other two waves similarly app eared in b o ok form much later than
the corresp onding scientific activity o ccurred.

14
CHAPTER 1. INTRODUCTION

The earliest predecessors of mo moderndern deep learning were simple linear mo models
dels
motiv
motivated
ated from a neuroscientific persp erspective.
ective. These mo models
dels were designed to
takeeThe
tak earliest
a set of n predecessors
input values ofx1mo , . .dern
. , xn deep learning
and asso ciatewere
associate themsimplewith linear
an output models y.
motiv
These moated
models from a neuroscientific
dels would learn a set of weigh p ersp ective.
eights These mo dels were
ts w1 , . . . , wn and compute their output designed to
f (x, w ) = x 1w1 + · · · + xnwn . This firstand
tak e a set of n input v alues x , . . . , x wavassoe of ciate
neural themnetw with
orks an
networks output
researc
research h was y.
These
kno
knownwnmo as dels
cyb would ,learn
cybernetics
ernetics
ernetics, a set of win
as illustrated eigh ts w
Fig. 1.7,.. . . , w and compute their output
f (x, w ) = x w + + x w . This first wave of neural networks research was
The McCullo
McCulloch-Pittsch-Pitts Neuron (McCullo McCulloch ch and Pitts, 1943) was an early mo modeldel
known as cybernetics · · ,· as illustrated in Fig. 1.7.
of brain function. This linear mo modeldel could recognize tw two o different categories of
The McCullo ch-Pitts Neuron
inputs by testing whether f (x, w ) is positiv (McCullo ch and Pitts , 1943
ositivee or negative. Of course, ) was anforearlythe momo
modeldel
del
of brain
to corresp function.
correspond ond to theThis desiredlinear model could
definition of therecognize
categories,twthe o different
weigh
weightsts categories
needed to be of
inputs
set by testing
correctly
correctly. . These whether
weigh
weights fts , w ) isbe
(xcould positiv
set beyorthe negative.
human Of op course, for
operator.
erator. In thethe mo del
1950s,
to corresp
the ond to(Rosen
perceptron the desired
Rosenblatt blatt, definition
1958, 1962of ) bthe
ecamecategories,
the firstthe mo weigh
model ts needed
del that could to be
learn
set correctly
the weigh
weights . These weigh ts could be set b y the
ts defining the categories given examples of inputs from each categoryhuman op erator. In the 1950s,
category..
the padaptive
The erceptron line(Rosen
linearar elementblatt, (ADALINE),
1958, 1962) bwhich ecamedates the first frommo del that
about the could
same learn
time,
the weigh ts defining the categories given
simply returned the value of f (x) itself to predict a real num examples of inputs from
umb each
ber (Widro
Widrow category
w and .
The
Hoff,adaptive
1960), and linecould
ar element (ADALINE),
also learn to predictwhich thesedatesnum
numbers fromfrom
bers about the same time,
data.
simply returned the value of f (x) itself to predict a real number (Widrow and
These simple learning algorithms greatly affected the mo modern
dern landscap
landscapee of
Hoff, 1960), and could also learn to predict these numbers from data.
mac
machine
hine learning. The training algorithm used to adapt the weigh weightsts of the ADA-
LINE These
was asimple
sp eciallearning
special case of an algorithms
algorithmgreatly called stoaffected
chasticthe
stochastic gr modern
gradient
adient desclandscap
descent
ent
ent.. Sligh e tly
Slightlyof
mac
mo hine learning.
modified
dified versions ofThe the training
sto
stocchastic algorithm
gradientt used
gradien descent to adapt
algorithm the remain
weightstheof the ADA-
dominan
dominant t
LINE was a sp ecial case
training algorithms for deep learning mo of an algorithm called
models sto
dels today
today..chastic gr adient desc ent . Sligh tly
modified versions of the stochastic gradient descent algorithm remain the dominant
Mo dels based on the f (x, w) used by the perceptron and ADALINE are called
Models
training algorithms for deep learning models today.
line
linear
ar momodels
dels
dels.. These mo modelsdels remain some of the most widely used machine learning
mo Mo
models, dels based
dels, though in man on the
many y fcases
(x, wthey) used arebtrained
y the perceptron
in differen
different and
t wADALINE
ays than the areoriginal
called
line
mo ar
models mo dels .
dels were trained. These mo dels remain some of the most widely used machine learning
models, though in many cases they are trained in different ways than the original
Linear
models weremo
models dels hav
trained. havee many limitations. Most famously famously,, they cannot learn the
XOR function, where f ([0 ([0,, 1] , w) = 1 and f ([1 ([1,, 0], w) = 1 but f ([1 ([1,, 1], w) = 0
Linear
and f ([0 mo dels hav e many
([0,, 0], w ) = 0. Critics who observ limitations.
observed Most
ed these fla famously
flaws , they
ws in linear mo cannot
models
dels learn
causedthe
X OR
a bac function,
backlash where f ([0 , 1] , w ) = 1 and f ([1
klash against biologically inspired learning in general (Minsky and Pap , 0] , w ) = 1 but f ([1 , 1] , w ) =
apert
ert0,
ert,
and ).
1969
1969). f ([0 , 0],w
This was )= the0.first Critics
ma
majorjorwhodipobserv
in theed these flaws
popularity in linear
of neural netwmo dels caused
networks.
orks.
a backlash against biologically inspired learning in general (Minsky and Papert,
1969T).oday
day,
This, neuroscience
was the firstisma regarded
jor dip as an imp
in the importan
ortan
ortantt source
popularity of neuralof inspiration
networks. for deep
learning researc
researchers, hers, but it is no longer the predominan predominantt guide for the field.
Today, neuroscience is regarded as an important source of inspiration for deep
The main reason for the diminished role of neuroscience in deep learning
learning researchers, but it is no longer the predominant guide for the field.
researc
research h to
today
day is that we simply do not ha hav ve enough information ab about
out the brain
The main reason for the diminished role
to use it as a guide. To obtain a deep understanding of the actual algorithms of neuroscience in deep learning
used
researc h to day is that w e simply do not ha ve enough
by the brain, we would need to be able to monitor the activity of (at the very information ab out the brain
to use thousands
least) it as a guide. To obtain a deepneurons
of interconnected understanding
sim of the actual
simultaneously
ultaneously
ultaneously. algorithms
. Because we areused not
by the
able to brain,
do this,weweware ouldfarneed fromtounderstanding
be able to monitor even some the activity
of the mostof (at the very
simple and
least) thousands of interconnected neurons simultaneously. Because we are not
able to do this, we are far from understanding 15 even some of the most simple and
CHAPTER 1. INTRODUCTION

well-studied parts of the brain (Olshausen and Field, 2005).


Neuroscience has given us a reason to hop hopee that a single deep learning algorithm
well-studied parts of the brain (Olshausen and Field, 2005).
can solve man many y differen
differentt tasks. Neuroscientists ha hav ve found that ferrets can learn to
Neuroscience
“see” with the auditory pro has given us a
processing reason to hop e that
cessing region of their brain a single deep brains
if their learning arealgorithm
rewired
can solve man y differen t tasks. Neuroscientists
to send visual signals to that area (Von Melchner et al. ha ve found that ferrets
al.,, 2000). This suggests can learn to
that
“see”
mucuchhwith
of thethemammalian
auditory pro cessing
brain region
might useofa their
singlebrainalgorithmif theirtobrains are rewired
solve most of the
to send
differen visual signals to that area ( V on Melchner et
differentt tasks that the brain solves. Before this hypothesis, machine learning al. , 2000 ). This suggests that
m uch of
researc
research the mammalian
h was more fragmented, brain with
mightdifferen
use a tsingle
different algorithm
communities of to solvehers
researc
researchers most of the
studying
differen t tasks that the brain solves. Before this hypothesis,
natural language processing, vision, motion planning and speech recognition. Toda machine learningdayy,
researc
these h was morecommunities
application fragmented,are with
stilldifferen t communities
separate, but it is common of researc
for hers
deepstudying
learning
natural
researc
research hlanguage
groups toprocessing,
study many vision,
or evenmotion
all ofplanning and speechareas
these application recognition.
sim Today,.
simultaneously
ultaneously
ultaneously.
these application communities are still separate, but it is common for deep learning
We are able to dra draw w some rough guidelines from neuroscience. The basic idea of
research groups to study many or even all of these application areas simultaneously.
ha
having
ving many computational units that become in intelligen
telligen
telligentt only via their interactions
withWeach e are other
able tois dra w someby
inspired rough guidelines
the brain. Thefrom Neo neuroscience.
Neocognitron
cognitron (Fukushima The basic,idea 1980of)
hatro
in ving
intro
troducedmanya computational
duced pow erful modelunits
owerful arc that become
architecture
hitecture for prointelligen
cessingt only
processing images viathat
theirwinteractions
as inspired
with each other is inspired by the brain. The
by the structure of the mammalian visual system and later became the basis Neo cognitron ( F ukushima , 1980for)
intromo
the duced
dern acon
modern pow
conv erful model
volutional netw arc
orkhitecture
network (LeCun for et al.pro cessing
, 1998b ), as images
we will thatseewinasSec.
inspired
9.10.
by theneural
Most structure
netw of
networks
orks theto mammalian
toda
da
day y are basedvisualon a system
mo
model and later
del neuron became
called the rethectifie
ctifiedbasis
d line for
linear
ar
the
unit mo dern con v olutional netw ork
unit.. The original Cognitron (Fukushima, 1975) in ( LeCun et al. , 1998b
intro
tro),
duced a more complicated.
troducedas we will see in Sec. 9.10
vMost
ersionneural
that wasnetwhighly
orks toinspired
day are by based
our onkno awledge
model of
knowledge neuron
brain called
function.the Therectifie d linear
simplified
unit
mo
modern. The original
dern version was dev Cognitron
developed ( F
eloped incorpukushima
incorporating , 1975 ) in tro
orating ideas from many viewpduced a more
viewpoints, complicated
oints, with Nair
vand
ersion
Hin
Hintonthat
ton (was
2010highly inspiredetby
) and Glorot al.our
(2011akno)wledge
citing of brain function.
neuroscience as anThe simplified
influence, and
mo dern version was dev eloped incorp orating ideas from
Jarrett et al. (2009) citing more engineering-oriented influences. While neuroscience many viewp oints, with Nair
and
is anHinimp ton
important (2010source
ortant ) and Glorot et al. (2011a
of inspiration, ) citing
it need notneuroscience
be tak
taken en asasa an influence,
rigid guide. and We
Jarrett
kno
know et al. ( 2009 ) citing more engineering-oriented
w that actual neurons compute very different functions than mo influences. While
modern neuroscience
dern rectified
is an imp
linear ortant
units, butsource
greater of neural
inspiration,
realism it need
has not not yet be tak
leden toasana impro
rigid guide.
improv ementtWine
vemen
know
mac
machine
hinethat actual p
learning neurons
erformance. compute Also,very different
while functions
neuroscience hasthan modern rectified
successfully inspired
linear
sev
several units,
eral neural netw but greater
network
ork arc neural realism has
hitectures, we do not yet kno
architectures not yet led
know to an
w enough ab impro
about v emen
out biologicalt in
mac hine learning p erformance.
learning for neuroscience to offer muc Also,
much while neuroscience has successfully
h guidance for the learning algorithms we inspired
several
use neuralthese
to train ork architectures, we do not yet know enough about biological
netwarchitectures.
learning for neuroscience to offer much guidance for the learning algorithms we
Media accoun
accounts ts often emphasize the similarity of deep learning to the brain.
use to train these architectures.
While it is true that deep learning researchers are more lik likely
ely to cite the brain as an
influence than researchers working in other machine learninglearning
Media accoun ts often emphasize the similarity of deep fields suc tohthe
such brain.
as kernel
While
mac
machines it isor
hines true
Bay that
Bayesian
esian deep learningone
statistics, researchers
should not are view
moredeeplikelylearning
to cite the brain
as an as an
attempt
influence
to sim
simulate
ulatethantheresearchers
brain. Mo Modernworking
dern deepinlearning
other machine learning fields
draws inspiration from sucmany
h as kernel
fields,
mac
esp hines
especially or Bay esian statistics, one should not
ecially applied math fundamentals like linear algebra, probability view deep learning as an
probability,, informationattempt
theory,, and numerical optimization. While some deep learningfrom
to sim
theory ulate the brain. Mo dern deep learning draws inspiration many
researc hersfields,
researchers cite
especially applied
neuroscience as anmathimp fundamentals
important
ortant source oflike linear algebra,
inspiration, othersprobability , information
are not concerned with
theory, and numerical optimization. While some deep learning researchers cite
neuroscience as an important source of inspiration, others are not concerned with
16
CHAPTER 1. INTRODUCTION

neuroscience at all.
It is w worth
orth noting that the effort to understand ho howw the brain works on
neuroscience at all.
an algorithmic lev levelel is alivalivee and well. This endea endeav vor is primarily known as
It is w orth noting that the effort
“computational neuroscience” and is a separate field of study to understand how the from brain
deepworks on
learning.
an isalgorithmic
It common forlev el is aliv
researc
researchers herse to and mov
movewell.
e bac
backThis
k andendea forthvor betwis een
etweenprimarily
both fields. knownThe as
“computational
field of deep learning neuroscience”
is primarily andconcerned
is a separate with field
howoftostudybuildfrom deep learning.
computer systems
It is common for researc
that are able to successfully solv hers to mov e bac k
solvee tasks requiring inand forth b etw
intelligence, een both
telligence, while the fieldfields. The
of
field of deep learning is primarily concerned with
computational neuroscience is primarily concerned with building more accurate how to build computer systems
that
mo
models
delsareofable
howtothe successfully
brain actually solveworks.
tasks requiring intelligence, while the field of
computational neuroscience is primarily concerned with building more accurate
In the
models of 1980s,
how the thebrain
second wave of
actually neural net
works. netw work research emerged in great part
via a movmovemen
emen
ementt called conne onnectionism
ctionism or par aral
al
allel
lel distribute
distributed d pr
proocessing (Rumelhart
et al., 1986c; McClelland et al., 1995). Connectionism arose in theincontext
In the 1980s, the second w a v e of neural net w ork research emerged great partof
via a
cognitivmov emen t called
cognitivee science. Cognitiv conne ctionism or par al lel distribute
Cognitivee science is an interdisciplinary approach to understand-d pr o cessing ( Rumelhart
et al.
ing , 1986c
the mind,; McClelland
combining multiple et al., 1995 ). Connectionism
different lev
levels arose in
els of analysis. the context
During the earlyof
cognitiv
1980s, e science.
most Cognitiv
cognitive scientistse science
studied is anmo interdisciplinary
models
dels of sym symb bolic approach
reasoning.toDespiteunderstand-
their
ing the
popularit
opularity mind, combining
y, symbolic mo models multiple different lev els
dels were difficult to explain in terms of hoof analysis. During
how the
w the brainearly
1980s, most cognitive scientists studied mo dels of sym
could actually implement them using neurons. The connectionists began to study b olic reasoning. Despite their
p
moopularit
dels ofy,cognition
models symbolicthat models could were difficult
actually be to explain in
grounded in neural
terms of how thetations
implemen
implementations brain
(could actually
Touretzky andimplement
Min
Minton ton, 1985 them), using
revivingneurons.
man
many yTheideasconnectionists
dating bacback k btoegan
the to study
work of
mo
psycdels
psychologistof cognition that could actually
hologist Donald Hebb in the 1940s (Hebb, 1949). b e grounded in neural implemen tations
(Touretzky and Minton, 1985), reviving many ideas dating back to the work of
psycThe central
hologist idea inHebb
Donald connectionism
in the 1940s is that a large
(Hebb , 1949num
number
). ber of simple computational
units can ac achiev
hiev
hievee inintelligen
telligen
telligentt behavior when net netw work
orked ed together. This insight
The central idea in connectionism
applies equally to neurons in biological nerv is that a
nervouslarge
ous systems of
num ber andsimple computational
to hidden units in
units can
computational moachiev e in
models.
dels.telligen t behavior when net w ork ed together. This insight
applies equally to neurons in biological nervous systems and to hidden units in
Sev eral key concepts arose during the connectionism mov
Several ementt of the 1980s
movemen
emen
computational models.
that remain central to to today’s
day’s deep learning.
Several key concepts arose during the connectionism movement of the 1980s
thatOneremainof these concepts
central to today’sis that deepof distribute
distributed
learning.d repr epresentation
esentation (Hinton et al., 1986).
This is the idea that eac each h input to a system should be represen represented ted by man many y features,
andOneeachoffeature
these concepts
should beis in that
inv olv
volvedof
ed distribute
in the d r
represenepr esentation
representationtation of ( Hinton
many p et al.,inputs.
ossible 1986).
This is the ideasuppose
For example, that eacwhe input ha
hav ve atovision
a system should
system thatbecan represen ted bcars,
recognize y mantrucy features,
trucks,
ks, and
and each feature
birds and these ob should
objects b
jects can eace in volv
each ed in the represen tation of many
h be red, green, or blue. One way of representing p ossible inputs.
F or example, suppose
these inputs would be to ha w e ha
hav v e a vision system that can recognize
ve a separate neuron or hidden unit that activ cars, truc atesand
ks,
activates for
birds
eac
each and these ob jects can eac h b e red, green, or blue.
h of the nine possible combinations: red truck, red car, red bird, green truck, and One w ay of representing
these
so on.inputs would benine
This requires to ha ve a separate
different neurons, neuron
and each or hidden
neuron unit
mustthat activ
indep ates for
independently
endently
each of
learn thetheconcept
nine possible
of colorcombinations:
and ob object red truck,
ject identit
identity y. Onered wacar,
y to red
improbird,
improv green
ve on thistruck, and
situation
so on. This requires nine different neurons, and each
is to use a distributed representation, with three neurons describing the color and neuron m ust indep endently
learn the
three concept
neurons of color and
describing the ob object
ject identit
object identit
identity yy.. One
Thiswrequires
ay to impro onlyvesixon neurons
this situation
total
is to useofa nine,
instead distributed
and the representation,
neuron describing with three
redness neurons
is abledescribing
to learn ab theoutcolor
about and
redness
three neurons describing the ob ject identity. This requires only six neurons total
instead of nine, and the neuron describing 17 redness is able to learn ab out redness
CHAPTER 1. INTRODUCTION

from images of cars, trucks and birds, not only from images of one sp specific
ecific category
of ob
objects.
jects. The concept of distributed represen representation
tation is central to this book, and
frombimages
will e describ ofed
described cars, trucks and
in greater birds,
detail not only 15
in Chapter from
. images of one specific category
of ob jects. The concept of distributed representation is central to this book, and
will Another
be describ ma
major
edjorin accomplishmen
accomplishment
greater detail int of the connectionist
Chapter 15. mov
movemenemen
ementt was the suc-
cessful use of back-propagation to train deep neural net netwworks with in internal
ternal repre-
sen Another
sentations
tations mathe
and jor paccomplishmen
opularization of t ofthethe connectionist mov
back-propagation ement was
algorithm the suc-
(Rumelhart
cessful
et al. use of; back-propagation
al.,, 1986a LeCun, 1987). This to algorithm
train deep has neuralwaxednetwandorkswaned
with in internal repre-
popularity
sentations
but as of thisandwriting
the popularization
is currently the of the back-propagation
dominan
dominant t approac
approach h toalgorithm
training deep (Rumelhart
models.
et al., 1986a; LeCun, 1987). This algorithm has waxed and waned in popularity
but During
as of thisthewriting
1990s, isresearc
researchers
hers made
currently important
the dominan adv
advances
t approac ances
h tointraining
mo
modeling
delingdeep sequences
models.
with neural netw networks.
orks. HoHochreiter
chreiter (1991) and Bengio et al. (1994) iden identified
tified some
During
of the the 1990s,
fundamental researchers made
mathematical difficultiesimportant
in mo advances
modeling
deling longinsequences,
modeling sequences
describ
describeded
with neural netw orks.
in Sec. 10.7. Hochreiter and Sc Ho chreiter
Schmidh(
hmidh 1991
hmidhub ub)
uber and Bengio et al. ( 1994 ) iden
er (1997) introduced the long short-term tified some
of the fundamental
memory or LSTM net mathematical
netw resolvee some ofinthese
work to resolv difficulties modeling long sequences,
difficulties. Todaday describ
y, the LSTM ed
in widely
is Sec. 10.7 used. Hochreiter and Schmidh
for many sequence mo
modeling uber tasks,
deling (1997)including
introduced many thenatural
long short-term
language
memory
pro
processing or LSTM
cessing tasks at Go net w ork
Google.
ogle. to resolv e some of these difficulties. T o da y, the LSTM
is widely used for many sequence modeling tasks, including many natural language
The second wa wave
ve of neural netw networks orks research lasted un untiltil the mid-1990s. Ven-
processing tasks at Google.
tures based on neural netw networks
orks and other AI technologies began to make unrealisti-
callyThe second wa
ambitious ve ofwhile
claims neural networks
seeking inv research lasted
investments.
estments. When un AItil the mid-1990s.
research Ven-
did not fulfill
tures based
these on neuralexp
unreasonable netw orks and inv
expectations,
ectations, other
investors AI technologies
estors were disapp bointed.
egan to Simultaneously
disappointed. make unrealisti-,
Simultaneously,
cally ambitious
other fields of mac claims
machine while seeking
hine learning made adv inv estments.
advances. When
ances. Kernel mac AI research
hines (did
machines not et
Boser fulfill
al.,
these unreasonable exp ectations,
1992; Cortes and Vapnik, 1995; Schölk inv
Schölkopf estors
opf et al.were disapp ointed.
al.,, 1999) and graphical mo Simultaneously
dels (Jor-,
models
other
dan fields
, 1998 ) bof
othmac hineedlearning
achiev
achieved go
goood resultsmadeonadv ances.
many imp Kernel
importan
ortan machines
ortantt tasks. These (Boser et al.,
two factors
1992
led to; Cortes
a decline andinVthe
apnik , 1995; Schölk
popularity opf etnetw
of neural al., orks
networks1999)that
andlasted
graphical
untilmo dels (Jor-
2007.
dan, 1998) both achieved good results on many important tasks. These two factors
During this time, neural net netwworks con contin
tin
tinued
ued to obtain impressiv
impressivee performance
led to a decline in the popularity of neural networks that lasted until 2007.
on some tasks (LeCun et al., 1998b; Bengio et al., 2001). The Canadian Institute
for During
Adv
Advanced
ancedthisResearc
time, neural
Research h (CIF net
(CIFAR)AR)works help
helpedcon
ed tin
toued
keep to neural
obtain netimpressiv
netw works eresearch
performance
alive
on some tasks (LeCun et al.
via its Neural Computation and Adaptiv , 1998b ; Bengio et al. , 2001 ). The Canadian
Adaptivee Perception (NCAP) research initiative. Institute
for Adv anced Researc h (CIF
This program united machine learning AR) helpedresearc
to keep
research neural led
h groups netw byorks research
Geoffrey alive
Hinton
via its Neural
at Universit
University Computation
y of Toron
oronto,
to, Yoshand
oshua Adaptiv e
ua Bengio at Univ P erception
Universit (NCAP)
ersit
ersity research
y of Montreal, and Yann initiative.
This
LeCun program
at Newunited York machine
Universit
University learning
y. The CIF researc
CIFARAR hNCAP groupsresearch
led by Geoffrey
initiativeHinton
had a
at Universit y of T oron to, Y osh ua Bengio
multi-disciplinary nature that also included neuroscien at Univ ersit
neuroscientists y of Montreal,
tists and experts in human and Yann
LeCun at New
and computer vision. Y ork Universit y . The CIF AR NCAP research initiative had a
multi-disciplinary nature that also included neuroscientists and experts in human
At this poin ointt in time, deep netw networks orks were generally believ elieved ed to be very difficult
and computer vision.
to train. W Wee now know that algorithms that hav havee existed since the 1980s work
quiteAtwell,
this but
pointhis
t in wtime,
as notdeep networks
apparent were
circa generally
2006. believ
The issue isedperhaps
to be vsimply
ery difficult
that
to train. W e now
these algorithms were to know too that algorithms that
o computationally costly to allo hav e existed
alloww muc since
much h expthe 1980s
experimentation work
erimentation
quite well, but
with the hardware av this w as not
available apparent
ailable at the time. circa 2006. The issue is p erhaps simply that
these algorithms were too computationally costly to allow much experimentation
The third wa wav ve of neural netw networksorks research began with a breakthrough in
with the hardware available at the time.
The third wave of neural networks18research began with a breakthrough in
CHAPTER 1. INTRODUCTION

2006. Geoffrey Hinton show showed ed that a kind of neural netw network ork called a deep belief
net
netwwork could be efficien efficiently tly trained using a strategy called greedy la layyer-wise
2006. Geoffrey
pretraining (Hin Hinton
Hinton
ton et al.show ed that a kind of neural
al.,, 2006), which will be describ netw
described ork called a
ed in more detail indeep belief
Sec.
net
15.1w ork could
15.1.. The other CIF be efficien
CIFAR-affiliated tly trained using a strategy
AR-affiliated research groups quickly show called
showedgreedy la yer-wise
ed that the same
pretraining (Hin ton et
strategy could be used to train manal. , 2006 ),
many which will b e describ
y other kinds of deep net ed in
netw more detail inetSec.
works (Bengio al.,
15.1.; The
2007 otheretCIF
Ranzato al.,,AR-affiliated
al. research groups
2007a) and systematically quickly
help
helped showede that
ed to improv
improve the same
generalization
strategy
on could be This
test examples. used wa to vtrain
wav many netw
e of neural otherorks
networkskinds of deep
researc
research networks (the
h popularized Bengio
use ofet the
al.,
2007
term; deRanzato
deep
ep le et al.to
learning
arning , 2007a ) and systematically
emphasize that researchers help ed to
were no
now improv
w able etogeneralization
train deeper
on test
neural netexamples.
netw This wa v e of neural netw
works than had been possible before, and to fo orks researc h p opularized
focus the
cus attention useonof the
the
term deep leimp
theoretical arning
importance to emphasize
ortance of depth (Bengio that researchers
and LeCun were
, 2007no;wDelalleau
able to train deeper,
and Bengio
neural
2011 ; Pnet
ascanworks
ascanu u et than
al. had b; een
al.,, 2014a possible
Montufar et bal.
efore,
al.,, 2014and
). At to fo
thiscustime,
attention
deep on the
neural
theoretical
net
netwworks outp imperformed
ortance ofcompeting
outperformed depth (Bengio and LeCun
AI systems based, 2007 ; Delalleau
on other machineandlearning
Bengio,
2011
tec ; Pascan
technologies u et al. , 2014a ; Montufar
hnologies as well as hand-designed functionalit et al. ,
functionality2014 ). At this time, deep
y. This third wave of popularity neural
netneural
of works netwoutporks
networkserformed
con
contintincompeting
ues to the AI
tinues timesystems
of this based
writing, onthough
other machine
the fo cuslearning
focus of deep
technologies as well as hand-designed functionalit y. This
learning research has changed dramatically within the time of this wave. The third w ave of popularity
of neural
third wavnetw
e began orkswith
contin a ues
fo
focus
cus to on
thenew
timeunsup
of this
unsupervised writing,
ervised though
learning the focus and
techniques of deep
the
learning
abilit
ability research
y of deep mo has
models changed dramatically within
dels to generalize well from small datasets, but tothe time of this w
today av e. The
day there is
third wave began
more interest in mwith
uc
uch h a
olderfo cus
sup on new
supervised
ervised unsup
learning ervised learning
algorithms and techniques
the ability and
ability the
of deep
abilit
mo
models
delsy of
to deep models
leverage largetolab generalize
labeled well from small datasets, but today there is
eled datasets.
more interest in much older supervised learning algorithms and the ability of deep
models to leverage large labeled datasets.
1.2.2 Increasing Dataset Sizes

1.2.2mayIncreasing
One wonder whyDataset Sizeshas only recently become recognized as a
deep learning
crucial tec
technology
hnology though the first exp experiments
eriments with artificial neural net netw works were
One may wonder
conducted why deep
in the 1950s. Deeplearning
learninghas hasonly
beenrecently become
successfully usedrecognized
in commercial as a
crucial technology
applications since thethough thebut
1990s, firstwexp eriments
as often with artificial
regarded as beingneural
more of netanworks were
art than
aconducted
technology inandthe something
1950s. Deep thatlearning
only anhasexp bert
expert eencould
successfully
use, untilused intly
recen
recently commercial
tly.
. It is true
applications since the 1990s,
that some skill is required to get go but w as often regarded as being more of
goood performance from a deep learning algorithm.an art than
Fa ortunately
technology, the
ortunately, and amoun
something
amount t of that only an exp
skill required ert could
reduces use,amount
as the until recen tly. It is data
of training true
that some The
increases. skill learning
is required to get gooreac
algorithms d phing
erformance
reaching human from a deep learning
performance on complexalgorithm.
tasks
F
to ortunately
toda
da
day , the
y are nearly iden amoun tical to the learning algorithms that struggled to solvedata
identicalt of skill required reduces as the amount of training toy
increases. The learning algorithms
problems in the 1980s, though the mo reac hing
models human p erformance on complex
dels we train with these algorithms hav tasks
havee
to day are nearly
undergone changesiden tical
that to thethe
simplify learning
trainingalgorithms thatarchitectures.
of very deep struggled to The solvemost
toy
problems
imp ortanttinnew
importan
ortan thedevelopmen
1980s, though
development the mo
t is that to delswe
today
day wecan
train with these
provide these algorithms
algorithms hav withe
undergone
the resourceschanges
they that
needsimplify the training
to succeed. Fig. 1.8ofshovery
wsdeep
shows howarchitectures. The most
the size of benchmark
imp ortan t new developmen
datasets has increased remark t is
remarkably that
ably ov
overto day we can provide these algorithms
er time. This trend is driven by the increasing with
the resources
digitization of they
so
societ
ciet
cietyneed
y. As to
moresucceed.
and more Fig.of1.8
oursho ws howtake
activities the place
size of
onbcomputers,
enchmark
datasets
more andhasmoreincreased
of what remark
we doablyisov er time. This
recorded. trend
As our is driven by
computers arethe increasing
increasingly
digitization
net
netwwork
orked of societyit. As
ed together, more and
becomes easiermore of our activities
to centralize take place
these records andoncurate
computers,
them
more and more of what we do is recorded. As our computers are increasingly
networked together, it becomes easier to19centralize these records and curate them
CHAPTER 1. INTRODUCTION

in
into
to a dataset appropriate for mac machinehine learning applications. The age of “Big
Data” has made mac machine
hine learning mucmuch h easier because the key burden of statistical
in to a dataset appropriate for mac
estimation—generalizing well to new data hine learning applications.
after observing only The age amoun
a small of “Bigt
amount
Data”
of has made
data—has macconsiderably
been hine learning lightened.
much easierAs because the key
of 2016, burden
a rough of of
rule statistical
th
thum
um
umb b
estimation—generalizing
is that a sup
supervised w ell to new data after observing
ervised deep learning algorithm will generally achiev only
achievee acceptablet
a small amoun
oferformance
p data—haswith beenaround
considerably
5,000 lab lightened.
labeled As ofper
eled examples 2016, a rough
category
category,, andrule
will of thumor
match b
is that human
exceed a supervised deep learning
performance algorithm
when trained withwill generally
a dataset conachiev
taininge at
containing acceptable
least 10
p erformance
million lab eled examples. Working successfully with datasets smaller than this or
labeledwith around 5,000 lab eled examples per category , and will match is
exceed
an imp human
importan
ortan performance
ortantt research area, fo when
focusing trained with a dataset
cusing in particular on ho how con taining
w we can take advat least
advantage 10
antage
million
of large lab eled examples.
quantities of unlabWeled
orking
unlabeled successfully
examples, with with
unsupdatasets
ervised smaller
unsupervised thanervised
or semi-sup this is
semi-supervised
an important research area, focusing in particular on how we can take advantage
learning.
of large quantities of unlabeled examples, with unsupervised or semi-supervised
learning.
1.2.3 Increasing Mo
Model
del Sizes

1.2.3 kIncreasing
Another ey reason that Mo del Sizes
neural net
netwworks are wildly successful to toda
da
day y after enjoying
comparativ
comparatively ely little success since the 1980s is that we hav havee the computational
Another
resourceskto ey run
reason
muc that
uch neural
h larger mo net
delsworks
models to
toda
da
dayyare
. Onewildly successful
of the today after
main insights enjoying
of connection-
comparativ
ism ely littlebecome
is that animals successin since
intelligenthe
telligen
telligent 1980sman
t when is that
many we hav
y of their e the computational
neurons work together.
resources to run m uc h larger mo dels to da y . One of
An individual neuron or small collection of neurons is not particularly the main insights of connection-
useful.
ism is that animals become intelligent when many of their neurons work together.
Biological neurons are not esp especially
ecially densely connected. As seen in Fig. 1.10,
An individual neuron or small collection of neurons is not particularly useful.
our mac
machine
hine learning mo models
dels hav
havee had a num number ber of connections per neuron that
Biological neurons are not esp ecially
was within an order of magnitude of even mammalian densely connected.
brains Asfor seen in Fig. 1.10,
decades.
our machine learning models have had a number of connections per neuron that
In terms of the total num umber ber of neurons, neural netw networks
orks hav
havee been astonishingly
was within an order of magnitude of even mammalian brains for decades.
small until quite recently
recently,, as shown in Fig. 1.11. Since the introduction of hidden
In terms
units, of the
artificial totalnet
neural num
netw ber of
works ha
havneurons,
ve doubled neural netw
in size orks hav
roughly e been
every 2.4astonishingly
years. This
small
gro
growth until
wth is drivquite
en by faster computers with larger memory and by the avof
driven recently , as shown in Fig. 1.11 . Since the introduction hidden
ailability
units,
of artificial
larger neural
datasets. netwnet
Larger orksworks
netw have aredoubled
able to in achiev
size roughly
achieve e higherevery 2.4 years.
accuracy This
on more
gro wth is driv en b y faster computers
complex tasks. This trend looks set to contin with
continue larger memory and by
ue for decades. Unless new tec the a vailability
technologies
hnologies
of larger
allo
allow
w fasterdatasets.
scaling,Larger
artificialnetw orks are
neural netwable
networksorkstowillachiev
note hav
higher
havee theaccuracy
same num onber
number more
of
complex tasks. This trend looks set to contin ue for decades.
neurons as the human brain until at least the 2050s. Biological neurons ma Unless new tec hnologies
mayy
allow faster
represen scaling, artificial neural
representt more complicated functions than curren netw orks will not hav e the same num
currentt artificial neurons, so biological ber of
neurons
neural net as
netw works may be even larger than thisthe
the human brain until at least plot2050s. Biological neurons may
portrays.
represent more complicated functions than current artificial neurons, so biological
In retrosp
retrospect,
ect, it is not particularly surprising that neural net netwworks with fewer
neural networks may be even larger than this plot portrays.
neurons than a leec leechh were unable to solv solvee sophisticated artificial in intelligence
telligence prob-
In
lems. Evretrosp
Even
en to ect,
today’s it is
day’s netw not
networks, particularly
orks, whic
which surprising that neural net w
h we consider quite large from a computationalorks with fewer
neurons than
systems pointa ofleec h were
view, areunable
smallerto solv
thane sophisticated
the nervous system artificialofineven
telligence prob-
relatively
lems. Eveen
primitiv
primitive today’s netw
vertebrate orks,like
animals whicfrogs.
h we consider quite large from a computational
systems point of view, are smaller than the nervous system of even relatively
The increase in mo model
del size over time, due to the availabilit ailabilityy of faster CPUs,
primitive vertebrate animals like frogs.
The increase in model size over time,
20 due to the availability of faster CPUs,
CHAPTER 1. INTRODUCTION

9
Increasing dataset size over time
10
Dataset size (number examples)

10
8 Canadian Hansard
7 time Sports-1M
Increasing dataset size overWMT
10 ImageNet10k
6
10 Public SVHN
10
5 Criminals ImageNet ILSVRC 2014
4
10
MNIST CIFAR-10
3
10
102 T vs G vs F Rotated T vs C
1 Iris
10
0
10
1900 1950 1985 2000 2015
Year
Figure 1.8: Dataset sizes ha hav ve increased greatly ov over
er time. In the early 1900s, statisticians
studied datasets using hundreds or thousands of manually compiled measuremen measurements ts (Garson,
Figure
1900 1.8: Dataset
; Gosset sizes hav,e1935
, 1908; Anderson increased
; Fishergreatly
, 1936).ovIn er the
time. In the
1950s early 1980s,
through 1900s, the
statisticians
pioneers
studied datasets using
of biologically inspired mac hundreds
machine or thousands
hine learning often work of manually
orked compiled
ed with small, syn measuremen
synthetic ts
thetic datasets, (Garson
such,
1900
as lo;w-resolution
Gosset, 1908bitmaps
low-resolution ; Anderson , 1935; that
of letters, Fisher , 1936
were ). In the
designed to1950s
incur through
lo
low 1980s, the pioneers
w computational cost and
of biologicallythat
demonstrate inspired
neuralmac hineorks
netw
networks learning oftentowork
were able ed sp
learn with
specific
ecificsmall,
kinds synofthetic datasets,
functions such
(Widrow
as lo w-resolution bitmaps
and Hoff, 1960; Rumelhart et al. of letters, that were designed to incur low computational
al.,, 1986b). In the 1980s and 1990s, machine learning cost and
demonstrate
b ecame morethat neural in
statistical netw orks and
nature werebable
egantotolearn
lev sp ecific
leverage
erage largerkinds of functions
datasets con (Widrow
containing
taining tens
andthousands
of Hoff, 1960 of; examples
Rumelhartsuch et al.as, the
1986b ). In the
MNIST 1980s
dataset and
(sho
(shownwn1990s,
in Fig.machine learning
1.9) of scans of
b ecame more
handwritten num statistical
numbers in nature and b egan to lev erage larger
bers (LeCun et al., 1998b). In the first decade of the 2000s, more datasets con taining tens
of thousands datasets
sophisticated of examples such
of this sameas the
size,MNIST
such as dataset
the CIF (showndataset
CIFAR-10
AR-10 in Fig. (1.9 ) of scansand
Krizhevsky of
handwritten
Hin ton, 2009)num
Hinton bers
contin
continuedued(LeCun
to b e pro et duced.
al., 1998b
produced. Tow ).ardInthe
oward theend first decade
of that of the
decade and2000s, more
throughout
sophisticated
the first half ofdatasets of this
the 2010s, same size,larger
significantly such datasets,
as the CIF AR-10 dataset
containing hundreds(Krizhevsky
of thousands and
Hintens
to ton,of2009 ) contin
millions of ued to b e pro
examples, duced. Tchanged
completely oward the endwofasthat
what decade
p ossible and
with throughout
deep learning.
the first
These half of included
datasets the 2010s,the significantly
public Street largerViewdatasets,
Housecontaining
Numbers dataset hundreds(Netzer
of thousands
et al.,
to tens
2011
2011), of millions of examples, completely changed what w as
), various versions of the ImageNet dataset (Deng et al., 2009, 2010a; Russako p ossible with deep learning.
Russakovskyvsky
These
et al., datasets
2014a), and included
the Sp the public dataset
Sports-1M
orts-1M Street View House et
(Karpathy Numbers
al., 2014 dataset
). At the(Netzer
top of et the
al.,
2011), vwe
graph, arious versions
see that of theofImageNet
datasets translateddataset
sentences,(Deng et al.
such as ,IBM’s 2009, dataset
2010a; Russako
constructedvsky
et al. , 2014a ), and the Sp orts-1M dataset ( Karpathy et
from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to Frenc al. , 2014 ). A t the top of the
rench h
graph, we
dataset (Sch see
Schwenk that datasets of translated sentences,
wenk, 2014) are typically far ahead of other dataset sizes.such as IBM’s dataset constructed
from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to French
dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.

21
CHAPTER 1. INTRODUCTION

Figure 1.9: Example inputs from the MNIST dataset. The “NIST” stands for National
Institute of Standards and Technology echnology,, the agency that originally collected this data.
Figure
The “M”1.9: Example
stands for “moinputs
“modified,” from
dified,” thethe
since MNIST dataset.
data has b een The “NIST”
prepro cessedstands
preprocessed for National
for easier use with
Institute
mac
machine of Standards and T echnology , the agency that originally
hine learning algorithms. The MNIST dataset consists of scans of handwritten digitscollected this data.
The
and “M”
asso standslab
associated
ciated forels
“mo
labels dified,” since
describing whic
which the data0-9
h digit hasis bcon
eentained
prepro
contained incessed for easier
each image. Thisusesimple
with
machine learning
classification algorithms.
problem is one ofThe
theMNIST
simplestdataset
and mostconsists of scans
widely of handwritten
used tests digits
in deep learning
and asso
researc h.ciated
research. lab els p
It remains describing whichbdigit
opular despite eing 0-9 is con
quite easytained
for moindern
each tec
modern image.
techniquesThistosimple
hniques solve.
classification
Geoffrey Hin
Hintonproblem
ton is one ed
has describ of the
described it assimplest
“the dr and most
drosophila
osophila of widely
machine used tests in meaning
learning,” deep learning
that
researc
it allowsh. machine
It remains p opular
learning despite b
researchers toeing quite
study theireasy for mo dern
algorithms techniques
in controlled labto solve.
laboratory
oratory
Geoffrey Hin
conditions, tonh has
muc
much describ edoften
as biologists “the drfruit
it as study osophila
flies. of machine learning,” meaning that
it allows machine learning researchers to study their algorithms in controlled lab oratory
conditions, much as biologists often study fruit flies.

22
CHAPTER 1. INTRODUCTION

the adven
adventt of general purp
purpose
ose GPUs (describ
(described
ed in Sec. 12.1.2), faster netnetw
work
connectivit
connectivityy and better softw
software
are infrastructure for distributed computing, is one of
the most
the advenimp
t ofortant
general
important purpin
trends osethe
GPUs (describ
history ed learning.
of deep in Sec. 12.1.2 ), faster
This trend network
is generally
connectivit
exp ected toy contin
expected and better
continue softw
ue well in arethe
into
to infrastructure
future. for distributed computing, is one of
the most important trends in the history of deep learning. This trend is generally
expected to continue well into the future.
1.2.4 Increasing Accuracy
Accuracy,, Complexit
Complexity
y and Real-W
Real-World
orld Impact

1.2.4 theIncreasing
Since 1980s, deep Accuracy
learning has, Complexit y anded
consistently improv
improved Real-W orld Impact
in its ability to provide
accurate recognition or prediction. Moreov Moreover, er, deep learning has consisten consistently tly been
Since the 1980s, deep learning has consistently
applied with success to broader and broader sets of applications. improv ed in its ability to provide
accurate recognition or prediction. Moreover, deep learning has consistently been
The earliest deep mo modelsdels were used to recognize individual ob objects
jects in tightly
applied with success to broader and broader sets of applications.
cropp
cropped,
ed, extremely small images ( Rumelhart et al. , 1986a ). Since then there has
The earliest deep mo dels w ere used
been a gradual increase in the size of images neural net to recognize netw individual
works could pro ob jects
cess.inMo
process. tightly
Modern
dern
cropp
ob
object ed, extremely
ject recognition netw small
networks images
orks proprocess ( Rumelhart
cess ricrich et al. , 1986a ). Since
h high-resolution photographs and do not then there has
been
ha
havve aa gradual
requirement increase that in the
the size
photo of images
be cropped neuralnear
netwthe orksob could
object
ject toprobcess. Modern
e recognized
ob ject recognition netw
(Krizhevsky et al., 2012). Similarly orks pro cess rich high-resolution
Similarly,, the earliest netw networks photographs and
orks could only recognize do not
tha
wovekinds
a requirement
of ob jects that
objects (or inthe somephotocases,be the
cropped
absencenear orthe ob jectoftoa bsingle
presence e recognized
kind of
(obKrizhevsky
object), et al.
ject), while these mo , 2012
modern). Similarly
dern netnetw , the earliest netw orks could
works typically recognize at least 1,000 different only recognize
tcategories
wo kinds of ob
of ob jects
objects. (or in
jects. The largestsome cases, the in
contest absence
ob jectorrecognition
object presence ofisa the single kind of
ImageNet
ob ject), while
Large-Scale theseRecognition
Visual modern netChallenge
works typically (ILSVRC)recognize held at eachleast 1,000
year. different
A dramatic
categories
momentt in of
momen theobmeteoric
jects. The riselargest
of deepcontest
learning in came
objectwhen recognition
a con
conv is the ImageNet
volutional netw
networkork
Large-Scale Visual Recognition Challenge (ILSVRC)
won this challenge for the first time and by a wide margin, bringing down the held each year. A dramatic
moment in the meteoric
state-of-the-art top-5 error riserate
of deepfromlearning
26.1% to came15.3% when a convolutional
(Krizhevsky et al.netw
, 2012ork
),
w on this challenge
meaning that the conv for the
convolutional first
olutional net time
netw and
work pro by
produces a wide margin, bringing
duces a ranked list of possible categories down the
state-of-the-art
for eac
eachh image and top-5 theerror
correctratecategory
from 26.1% to 15.3%
appeared in the (Krizhevsky
first fiv et al., of
fivee entries 2012
this),
meaning
list for allthatbutthe15.3%
convolutional
of the test netw ork produces
examples. a ranked
Since then, list
these of pcomp
ossible
competitionscategories
etitions are
for eac
consisten h
consistentlyimage and the
tly won by deep conv correct category
convolutional appeared in the
olutional nets, and as of this writing, advfirst fiv e entries
advancesof this
ances in
list for all
deep learning ha buthav15.3% of the test examples. Since
ve brought the latest top-5 error rate in this contest do then, these comp downetitions
wn to 3.6%, are
consisten
as sho wn tly
shown won 1.12
in Fig. by deep
. convolutional nets, and as of this writing, advances in
deep learning have brought the latest top-5 error rate in this contest down to 3.6%,
Deep
as sho wn learning
in Fig. 1.12 has. also had a dramatic impact on sp speech
eech recognition. After
impro
improving
ving throughout the 1990s, the error rates for sp speech
eech recognition stagnated
Deep
starting in ablearning
about has also had a dramatic
out 2000. The introduction of deep learning impact on sp(eech
Dahlrecognition.
et al., 2010; After
Deng
impro
et al., ving
2010bthroughout
; Seide et al. the 1990s,
, 2011 the error
; Hinton et al.rates
, 2012a for) tospeech
sp
speech
eechrecognition
recognition stagnated
resulted
starting
in a suddenin ab out 2000.
drop of error The introduction
rates, with someoferror deep rates
learning cut (inDahlhalf.et W
al.e, will
2010explore
; Deng
et al.history
this , 2010b;inSeidemoreetdetail
al., 2011 ; Hinton
in Sec. 12.3.et al., 2012a) to speech recognition resulted
in a sudden drop of error rates, with some error rates cut in half. We will explore
Deep netnetwworks ha hav ve also had sp spectacular
ectacular successes for pedestrian detection and
this history in more detail in Sec. 12.3.
image segmentation (Sermanet et al., 2013; Farab arabet et et al.al.,, 2013; Couprie et al. al.,,
2013 Deep net works
2013)) and yielded sup ha v
superh e
erhalso
erhuman had sp ectacular successes for p
uman performance in traffic sign classification (Ciresanedestrian detection and
image segmentation (Sermanet et al., 2013; Farabet et al., 2013; Couprie et al.,
2013) and yielded superhuman performance 23
in traffic sign classification (Ciresan
CHAPTER 1. INTRODUCTION

Number of connections per neuron over time


4 Human
10

Number of connections per neuron6 over 9time Cat


Connections per neuron

7
4
3
10 Mouse
2
10
5

2
8
10 3 Fruit fly
1

1
10
1950 1985 2000 2015
Year
Figure 1.10: Initially
Initially,, the number of connections b et etw
ween neurons in artificial neural
net
netw
works was limited by hardware capabilities. To day day,, the num
numberber of connections b et
etween
ween
Figure 1.10:
neurons Initially
is mostly , the consideration.
a design number of connections b etween
Some artificial neurons
neural netw in artificial
networks
orks hav neural
havee nearly as
net
manw
many orks was limited b y hardware capabilities. T o day , the num ber
y connections p er neuron as a cat, and it is quite common for other neural netof connections b et
netw ween
works
neurons
to hav is many
havee as mostlyconnections
a design consideration.
p er neuron asSome artificial
smaller mammals neural
likenetw orks
mice. havethe
Even nearly as
human
many do
brain connections
does
es not hahavvpeeranneuron as at cat,
exorbitan
exorbitant amoun
amountandt of
it is quite common
connections for otherBiological
p er neuron. neural netneural
works
to hav
net
netw
worke as many
sizes fromconnections
Wikip ediap(er
Wikipedia neuron
2015 ). as smaller mammals like mice. Even the human
brain do es not have an exorbitant amount of connections p er neuron. Biological neural
netw1.ork sizes linear
Adaptive fromelement
Wikip(Widrow
edia (2015 ). , 1960)
and Hoff
2. Neocognitron (Fukushima, 1980)
3. GPU-accelerated convolutional network (Chellapilla et al., 2006)
4. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a)
5. Unsupervised convolutional network (Jarrett et al., 2009)
6. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
7. Distributed autoencoder (Le et al., 2012)
8. Multi-GPU convolutional network (Krizhevsky et al., 2012)
9. COTS HPC unsupervised convolutional network (Coates et al., 2013)
10. GoogLeNet (Szegedy et al., 2014a)

24
CHAPTER 1. INTRODUCTION

et al.
al.,, 2012).
At the same time that the scale and accuracy of deep netw networksorks has increased,
et al., 2012).
so has the complexity of the tasks that they can solve. Go Goo odfellow et al. (2014d)
sho
show A t the same
wed that neural netw time that
networks the scale and accuracy of
orks could learn to output an entire sequence deep netw orks has increased,
of characters
so has the
transcrib
transcribed edcomplexity
from an image, of therathertasksthan thatjusttheyidentifying
can solve.aGo odfellow
single ob et al.
object.
ject. (2014d),
Previously
Previously,
showwasedwidely
it that neural
believed netwthatorksthis could
kind learn to output
of learning an entire
required lab sequence
labeling
eling of the of characters
individual
transcrib
elemen
elements ed from an image, rather than just
ts of the sequence (Gülçehre and Bengio, 2013). Recurren identifying a single ob ject.
Recurrentt neural net Previously
netw works,,
it whasaswidely
suc
such the LSTMbelievedsequence
that thismo kind
modeldelofmentioned
learning required ab
abovov
ove,e, lab
areeling
nowofused the individual
to mo modeldel
elemen ts of
relationships bet the sequence
etwween se (
sequenc Gülçehre
quenc
quences and
es and other se Bengio ,
sequenc
quenc2013
quences ). Recurren t neural
es rather than just fixed inputs. net w orks,
such sequence-to-sequence
This as the LSTM sequence modelseems
learning mentioned to be ab onovthee, arecuspnow used
of rev to model
revolutionizing
olutionizing
relationships
another betweenmachine
application: sequences and other(Sutskev
translation sequences
Sutskever er etrather
al. than; just
al.,, 2014 fixed inputs.
Bahdanau et al.
al.,,
This
2015
2015).
).sequence-to-sequence learning seems to b e on the cusp of rev olutionizing
another application: machine translation (Sutskever et al., 2014; Bahdanau et al.,
This trend of increasing complexit complexity y has been pushed to its logical conclusion
2015).
with the introduction of neural Turing machines (Grav Graves es et al.
al.,, 2014a) that learn
This trend of increasing complexit
to read from memory cells and write arbitrary con y has b een pushed
contententt to its
ten to logicalcells.
memory conclusionSuc
Such h
with the
neural net
netwintroduction of neural T uring machines ( Grav
works can learn simple programs from examples of desired behavior. For es et al. , 2014a ) that learn
example, they memory
to read from can learncells to sortandlistswriteof arbitrary
num
umbers conten
bers given t to memory
examples cells. Suc
of scrambled andh
neural net
sorted works canThis
sequences. learn simple programs technology
self-programming from examples is inofitsdesired
infancy
infancy,behavior.
, but in the For
example,
future couldtheyincan learn to
principle besort lists to
applied of nearly
numbers an
any ygiven
task.examples of scrambled and
sorted sequences. This self-programming technology is in its infancy, but in the
Another crowning achiev achievement ement of deep learning is its extension to the domain
future could in principle be applied to nearly any task.
of reinfor
einforccement le learning
arning
arning.. In the context of reinforcement learning, an autonomous
agen
agentAnother
t must learn crowning achievement
to perform a task of bydeep
triallearning
and error, is its extension
without an
any to the domain
y guidance from
of reinfor
the human op c ement
operator.le arning . In the context of reinforcement
erator. DeepMind demonstrated that a reinforcement learning system learning, an autonomous
agen t must learn
based on deep learning to perform is capablea taskofby trial and
learning to error,
play Atariwithoutvideo anygames,
guidance from
reaching
the human
human-lev
uman-level op erator. DeepMind demonstrated that a
el performance on many tasks (Mnih et al., 2015). Deep learning has reinforcement learning system
basedsignificantly
also on deep learning improv
improved isedcapable of learningoftoreinforcement
the performance play Atari video games,
learning for reaching
rob
robotics
otics
h uman-lev el
(Finn et al., 2015).p erformance on many tasks ( Mnih et al. , 2015 ). Deep learning has
also significantly improved the performance of reinforcement learning for robotics
(FinnMan
Many
etyal.of, these
2015).applications of deep learning are highly profitable. Deep learning
is now used b byy many top technology companies including Go Google,
ogle, Microsoft,
FacebMan
acebo y ofIBM,
ook, theseBaidu,
applications
Apple,ofAdobe, deep learning
Netflix,are highly and
NVIDIA profitable.
NEC. Deep learning
is now used by many top technology companies including Google, Microsoft,
FacebAdv
dvances
ances
ook, IBM,in deep
Baidu, learning
Apple,hav have e also Netflix,
Adobe, dep
depended
ended hea
heavily
NVIDIA vilyandon adv
advances
NEC. ances in softw software are
infrastructure. Softw Softwareare libraries such as Theano (Bergstra et al., 2010; Bastien
A dv ances in
et al., 2012), PyLearn2 deep learning
(Go
Goo have also
odfellow et al.dep ended), hea
, 2013c Torchvily(on advert
Collob
Collobert ances
et al.in ,softw
2011b are
),
infrastructure.
DistBelief (DeanSoftwet al.are, 2012libraries
), Caffe such(Jia as, 2013
Theano (Bergstra
), MXNet (Chen et al.
et ,al.
2010
, 2015 ; Bastien
), and
etensorFlow
T al., 2012),(AbadiPyLearn2 et al.(Go odfellow
, 2015 haveeetallal.supported
) hav , 2013c), Timp orch
importan (Collob
ortan
ortant ert et
t researc
research h proal., jects
2011bor),
projects
DistBelief
commercial pro ( Dean et
products.
ducts. al. , 2012 ), Caffe (Jia , 2013 ), MXNet ( Chen et al. , 2015 ), and
TensorFlow (Abadi et al., 2015) have all supported important research pro jects or
Deep learning
commercial products. has also made contributions back to other sciences. Mo Modern
dern
con
convvolutional netw networks
orks for ob object
ject recognition provide a mo modeldel of visual pro processing
cessing
Deep learning has also made contributions back to other sciences. Modern
convolutional networks for ob ject recognition 25 provide a model of visual processing
CHAPTER 1. INTRODUCTION

that neuroscientists can study (DiCarlo, 2013). Deep learning also pro provides
vides useful
to
tools
ols for pro
processing
cessing massiv
massivee amounts of data and making useful predictions in
that
scien neuroscientists
scientific
tific fields. It has can study
been (DiCarlo,used
successfully 2013to). predict
Deep learning also prowill
how molecules videsinteract
useful
tools
in orderfor to
pro cessing
help massive amounts
pharmaceutical of data
companies designandnewmaking
drugsuseful
(Dahlpredictions
et al., 2014in),
scien tific
to searc
search fields. It has b een successfully used to predict how molecules
h for subatomic particles (Baldi et al., 2014), and to automatically parse will interact
in order toe help
microscop
microscope images pharmaceutical companies
used to construct a 3-Ddesign
map of newthedrugs
human (Dahl et (al.
brain , 2014
Kno
Knowles- ),
wles-
to searceth al.
Barley for, 2014
al., subatomic particles
). We exp
expect
ect deep (Baldi et al.to, 2014
learning app
appear), and
ear to automatically
in more parse
and more scientific
microscop
fields in thee images
future. used to construct a 3-D map of the human brain (Knowles-
Barley et al., 2014). We expect deep learning to appear in more and more scientific
In summary
summary,, deep learning is an approac approach h to machine learning that has dra drawn
wn
fields in the future.
hea
heavily
vily on our knowledge of the human brain, statistics and applied math as it
dev In summary
develop
elop ed over ,the
eloped deep learning
past severalis decades.
several an approac
In hrecen
to machine
recent t years, learning thattremendous
it has seen has drawn
hea
gro vily in
growth
wth onits
ourpopularit
knowledge
opularity y andof usefulness,
the human due brain, statistics
in large partand applied
to more pow math
owerful as it
erful com-
dev elop ed ov er the past several decades.
puters, larger datasets and techniques to train deep In recen t years,
deeper
er netw it has
networks. seen tremendous
orks. The years ahead
growth
are in cits
full of popularit
hallenges y and
and opp usefulness,todue
opportunities
ortunities in large
improv
improve e deeppart to more
learning evenpow erful com-
further and
puters, larger datasets
bring it to new frontiers. and techniques to train deep er netw orks. The years ahead
are full of challenges and opportunities to improve deep learning even further and
bring it to new frontiers.

26
CHAPTER 1. INTRODUCTION

Number of neurons (logarithmic scale) Increasing neural netw


network
ork size ov
over
er time
1011 Human
1010 Increasing neural
17 network size20
over time
109 16 19 Octopus
108 14 18
107 11 Frog
106 8
105 3 Bee
Ant
104
103 Leech
13
102
101 1 2 12 Roundworm
6 15
100 9
5 10
10−1 4 7
10−2 Sponge
1950 1985 2000 2015 2056
Year
Figure 1.11: Since the introduction of hidden units, artificial neural netw networks
orks hav
havee doubled
in size roughly every 2.4 years. Biological neural netw network
ork sizes from Wikip
Wikipedia
edia (2015).
Figure 1.11: Since the introduction of hidden units, artificial neural networks have doubled
in size roughly (every
1. Perceptron 2.4, 1958
Rosenblatt years. Biological
, 1962) neural network sizes from Wikip edia (2015).
2. Adaptive linear element (Widrow and Hoff, 1960)
3. Neocognitron (Fukushima, 1980)
4. Early back-propagation network (Rumelhart et al., 1986b)
5. Recurrent neural network for speech recognition (Robinson and Fallside, 1991)
6. Multilayer perceptron for speech recognition (Bengio et al., 1991)
7. Mean field sigmoid belief network (Saul et al., 1996)
8. LeNet-5 (LeCun et al., 1998b)
9. Echo state network (Jaeger and Haas, 2004)
10. Deep belief network (Hinton et al., 2006)
11. GPU-accelerated convolutional network (Chellapilla et al., 2006)
12. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a)
13. GPU-accelerated deep belief network (Raina et al., 2009)
14. Unsupervised convolutional network (Jarrett et al., 2009)
15. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
16. OMP-1 network (Coates and Ng, 2011)
17. Distributed autoencoder (Le et al., 2012)
18. Multi-GPU convolutional network (Krizhevsky et al., 2012)
19. COTS HPC unsupervised convolutional network (Coates et al., 2013)
20. GoogLeNet (Szegedy et al., 2014a)

27
CHAPTER 1. INTRODUCTION

Decreasing error rate ov


over
er time
0.30
ILSVRC classification error rate

Decreasing error rate over time


0.25

0.20

0.15

0.10

0.05

0.00
2010 2011 2012 2013 2014 2015
Year
Figure 1.12: Since deep netnetwworks reached the scale necessary to comp
compete
ete in the ImageNet
Large Scale Visual Recognition Challenge, they hav havee consistently won the comp
competition
etition
Figure
ev
every 1.12: and
ery year, Since deep net
yielded lowwer
lowerorks
andreached
low
lower the scale
er error ratesnecessary to comp
each time. Dataete in the
from ImageNet
Russak
Russakovsky
ovsky
Large
et al. (Scale
2014bVisual
) and HeRecognition
et al. (2015Challenge,
). they have consistently won the comp etition
every year, and yielded lower and lower error rates each time. Data from Russakovsky
et al. (2014b) and He et al. (2015).

28
Part I
Part I
Applied Math and Mac
Machine
hine
Learning Basics
Applied Math and Machine
Learning Basics

29

29
This part of the book in intro
tro
troduces
duces the basic mathematical concepts needed to
understand deep learning. We begin with general ideas from applied math that
allo
allow
wThis part
us to of the
define book inof
functions tromany
duces vthe basic find
ariables, mathematical
the highestconcepts
and low needed
lowest
est to
points
understand
on deep learning.
these functions We bdegrees
and quantify egin with
of bgeneral
elief. ideas from applied math that
allow us to define functions of many variables, find the highest and lowest points
Next, we describ
describee the fundamen
fundamental tal goals of machine learning. We describe how
on these functions and quantify degrees of belief.
to accomplish these goals by sp specifying
ecifying a momodel
del that represen
representsts certain beliefs,
Next, w e describ e the fundamen tal goals of machine learning.
designing a cost function that measures how well those beliefs corresp We describe
correspond how
ond with
to accomplish
realit
reality
y and usingthese goals balgorithm
a training y specifyingto aminimize
mo del that
that represen ts certain beliefs,
cost function.
designing a cost function that measures how well those beliefs correspond with
This
realit elementary
y and framew
framework
using a training ork is the basis
algorithm for a broad
to minimize thatvariety of mac
machine
cost function. hine learning
algorithms, including approac
approaches
hes to machine learning that are not deep. In the
This elementary
subsequen
subsequent t parts of framew
the bo ork
ok, is
book, wethe basis for
develop deepa broad variety
learning of machine
algorithms learning
within this
algorithms,
framew
framework.
ork. including approac hes to machine learning that are not deep. In the
subsequent parts of the book, we develop deep learning algorithms within this
framework.

30
Chapter 2
Chapter 2
Linear Algebra
Linear Algebra
Linear algebra is a branc
branch
h of mathematics that is widely used throughout science
and engineering. Ho How wev
ever,
er, because linear algebra is a form of contin continuous
uous rather
Linear
than algebramathematics,
discrete is a branch of manmathematics
many y computer that is widely
scientists haveeused
hav littlethroughout
exp erience science
experience with it.
and
A go engineering.
goood understanding of linear algebra is essential for understanding and wrather
Ho w ev er, b ecause linear algebra is a form of contin uous orking
than discrete
with man
many mathematics,
y mac
machine man y computer
hine learning algorithms, esp scientists
ecially deep learning algorithms. Wit.
especially hav e little exp erience with e
A go o d understanding
therefore precede our in of
introlinear
tro
troduction algebra is essential for
duction to deep learning with a fo understanding
focused and w orking
cused presentation of
withkey
the man y mac
linear hine learning
algebra algorithms, especially deep learning algorithms. We
prerequisites.
therefore precede our introduction to deep learning with a focused presentation of
If you are already familiar with linear algebra, feel free to skip this chapter. If
the key linear algebra prerequisites.
you hav
havee previous exp experience
erience with these concepts but need a detailed reference
If y
sheet tooureview
are already familiar we
key formulas, with linear algebra,
recommend feel freeCo
The Matrix tookb
Cookbskip
okboookthis chapter.
(Petersen If
and
you
P have, previous
edersen 2006). Ifexp youerience
ha
havve with
no expthese
osureconcepts
exposure at all tobut needalgebra,
linear a detailed
thisreference
chapter
sheet to
will teac
teach review key formulas, we recommend The Matrix Co
h you enough to read this bo ok, but we highly recommend that you also okb ook ( Petersen and
P edersen , 2006 ). If
consult another resource fo you ha v e
focusedno exp osure
cused exclusiv
exclusively at all to linear algebra, this
ely on teaching linear algebra, such aschapter
will
Shiloteac
Shilov h you enough to read this b
v (1977). This chapter will completely omito ok, but we highly
man
many y recommend
imp
importan
ortan that you
ortantt linear also
algebra
consultthat
topics another
are notresource
essential fo cused exclusively on
for understanding teaching
deep learning.linear algebra, such as
Shilov (1977). This chapter will completely omit many imp ortant linear algebra
topics that are not essential for understanding deep learning.
2.1 Scalars, Vectors, Matrices and Tensors

2.1 study
The Scalars,
of linear V ectors,
algebra inv Matrices
involv
olv
olves and
es several types Tensors ob
of mathematical objects:
jects:

The study of linear algebra involves several types of mathematical ob jects:


• Scalars:: A scalar is just a single num
Scalars
alars umb b er, in contrast to most of the other
ob
objects
jects studied in linear algebra, whic
whichh are usually arrays of multiple num
numbers.
bers.
Scalars : A scalar is just a single num b er, in contrast
We write scalars in italics. We usually give scalars low to most
lower-case of the other
er-case variable names.
ob jectswe
• When studied
in
intro
troin linear
troduce algebra,
duce them, we sp whic h
specify are usually arrays
ecify what kind of num ofbm
numb erultiple
they num
are. bers.
For
We write scalars in italics. We usually give scalars lower-case variable names.
When we introduce them, we specify 31 what kind of number they are. For

31
CHAPTER 2. LINEAR ALGEBRA

example, we migh mightt sa y “Let s ∈ R be the slop


say slopee of the line,” while defining a
real-v alued scalar, or “Let n ∈ NRbe the num
real-valued numb ber of units,” while defining a
example,
natural num w
umbe migh t sa
ber scalar.y “Let s b e the slop e of the line,” while defining a
N
real-valued scalar, or “Let n ∈ be the number of units,” while defining a
• Vnatural
ctors::num
ectors ber scalar.
A vector is an array ∈ of num
numb bers. The num numb b ers are arranged in
order. We can iden identify
tify eac
each
h individual num numb ber by its index in that ordering.
V e ctors : A vector is
Typically we give vectors low an array
lower of num b ers.
er case names written The num inb ers
boldare arranged
typeface, sucin
suchh
order. W e can iden tify each individual
• as x. The elements of the vector are iden num
identifiedb er by its index in that ordering.
tified by writing its name in italic
tTypypically
ypeface, we give
eface, with vectors low
a subscript. Theer first
case element
names written
of x is xin bold typeface, such
1 , the second element
as xx. and
is The so elements
on. Weofalsothe need
vectortoaresayiden
what tified by of
kind writing
num
umb bits
ers name in italic
are stored in
2
tthe
ypeface, with a subscript. The first element of
vector. If each element is in R, and the vector has n elemenx is x , the second
elements, element
ts, then the
is x and so on. W e also need
vector lies in the set formed by taking to say what kind of
the Cartesian pro n um duct of Rstored
b
producters are in
n times,
R
the vector.
denoted as R Ifneach
. Whenelement is into explicitly
we need , and the iden vector
tifyhas
identify elements,
the nelements of then the
R a vector,
wvector
e writeliesthem
in the set formed by taking the Cartesian product of n times,
R as a column enclosed in square brack brackets:
ets:
denoted as . When we need to explicitly identify the elements of a vector,
 
we write them as a column enclosed inx1square brackets:
 x2 
 
x =  x.  . (2.1)
 x.. 
x = x.n . (2.1)
..

We can think of vectors as identifying  x p oin ts in space, with each element


oints
giving the coordinate along a different  axis.  ts in space, with each element
We can think of vectors as identifying  p oin 
Sometimes we need toalong
giving the coordinate indexa adifferent  elements
set of 
axis. of a vector. In this case, we
 
define a set con containing
taining the indices and write the set as a subscript. For
example, to access xto
Sometimes we need index a set of elements of a vector. In this case, we
1 , x3 and x6 , we define the set S = { 1, 3, 6} and write
define a set containing the indices and write the set as a subscript. For
x S . We use the − sign to index the complement of a set. For example x−1 is
example,
the vectortocon access
containing x ,all
taining x elemen
and xts, we
elements of xdefine
exceptthe forsetx1 ,Sand
= x 1, 3, 6is the
andvwrite
ector
−S
x
con . W e
containing use the sign to index the complement
taining all of the elements of x except for x1, x 3 and x6 . of a set. F{ or example
} x is
the vector containing − all elements of x except for x , and x is the vector
• Matric
con taining
Matrices es:: Aall
es of theiselements
matrix a 2-D arra ofyxofexcept
array num
numb for so
bers, x ,eac
x hand
each x .t is identified by
elemen
element
two indices instead of just one. We usually giv givee matrices upp upper-case
er-case variable
Matric es : A matrix
names with bold typ is a
ypeface, 2-D
eface, suc arra
such y of num b ers,
h as A . If a real-v so alued matrix Ais has
eac
real-valued h elemen t identified
a heighby
heightt
twomindices
• of and a instead
width of of njust one.we
, then Wsay
e usually
that A giv∈e matrices
Rm×n. W upp er-case identify
e usually variable
names
the withtsbof
elemen
elements oldatmatrix
ypeface,usingsuch itsas A . If ainreal-v
name italicalued matrix A has a height
Rbut not bold font, and the
of m and
indices area listed
widthwith
of nseparating
, then we say that AFor example,
commas. . WeAusually identify
1,1 is the upp
upperer
the elemen ts of a matrix using
left entry of A and Am,n is the bottom righ its name in italic

rightt entrybut not b old font,
entry.. We can identify all of and the
indices
the num
numb are
berslisted
withwith separating
vertical co
coordinatecommas.
ordinate For example,
i by writing a “ :” forA the is horizon
the upptal
er
horizontal
left
co entry ofFA
coordinate.
ordinate. or and A A
example, is the bottom righ t entry
tal. cross
We can identify
of A all of
i,: denotes the horizon
horizontal section with
the
v numco
ertical bers with vertical
coordinate
ordinate i. This is cokno
knownwn asi the
ordinate by writing
i-th rowaof“ :A ” for the horizon
. Likewise, A:,ital
is
coordinate. For example, A denotes the horizontal cross section of A with
vertical co ordinate i. This is known as the i-th row of A. Likewise, A is
32
CHAPTER 2. LINEAR ALGEBRA

2 3
A1,1 A1,2  
A1,1 A2,1 A3,1
A = 4 A2,1 A2,2 5 ) A> =
A1,2 A2,2 A3,2
A3,1 A3,2

Figure 2.1: The transp


transpose
ose of the matrix can be thought of as a mirror image across the
main diagonal.
Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the
main diagonal.
the i -th column of A . When we need to explicitly iden identify
tify the elemen
elementsts of a
matrix, we write them as an array enclosed in square brac brack kets:
the i -th column of A . When we need to explicitly identify the elements of a
 
matrix, we write them as an array A 1,enclosed
1 A 1,2 in square brackets:
. (2.2)
A 2,1 A 2,2
A A
. (2.2)
Sometimes we ma may y need to indexAmatrix-v A alued expressions that are not just
matrix-valued
a single letter. In this case, we use subscripts after the expression, but do
Sometimes
not con
convvertwe may need
anything to index
to low
lower matrix-v
er case. alued expressions
For example, f (A )i,j giv that
gives are not
es elemen
element just
t (i, j)
a single letter. In this case, we use subscripts after the expression, but do
of the matrix computed by applying the function f to A.
not convert anything to lower case. For example, f (A ) gives element (i, j )
• Tofensors
the matrix
ensors: computed
: In some cases wby applying
e will need the function
an arra
array to A.than tw
y withf more two
o axes. In
the general case, an array of num numb bers arranged on a regular grid with a
Tariable
v ensors:num In some
numb ber ofcases weknown
axes is will need
as aan arrayW
tensor. with moreathan
e denote tensortwnamed
o axes. “A”
In
the general
• with case, anA.array
this typeface: of numbthe
We identify ers element
arrangedofon a regular
A at co grid (with
coordinates
ordinates i, j, ka)
variable
b y writing num ber. of axes is known as a tensor. We denote a tensor named “A”
Ai,j,k A A
with this typeface: . We identify the element of at co ordinates (i, j, k )
A
by writing .
One impimportant
ortant opoperation
eration on matrices is the tr transp
ansp
anspose
ose
ose.. The transp
transpose
ose of a
matrix is the mirror image of the matrix across a diagonal line, called the main
One ,imp
diagonal
diagonal, ortantdown
running operation
and toonthematrices
righ
right, is the tr
t, starting anspits
from oseupp
. The
upper transp
er left ose of
corner. Seea
matrix
Fig. 2.1isforthe mirror image
a graphical of theofmatrix
depiction this op across
operation.
eration.a diagonal
We denote line,thecalled
transpthe
osemain
transpose of a
diagonal , running> down and to the
matrix A as A , and it is defined such that righ t, starting from its upp er left corner. See
Fig. 2.1 for a graphical depiction of this operation. We denote the transpose of a
matrix A as A , and it is defined(Asuch >
)i,jthat
= Aj,i. (2.3)

thoughtt of as(A
Vectors can be though ) =A
matrices . contain only one column. (2.3)
that The
transp
transpose
ose of a vector is therefore a matrix with only one row. Sometimes we
Vectors can be thought of as matrices that contain only one column. The
33
transpose of a vector is therefore a matrix with only one row. Sometimes we
CHAPTER 2. LINEAR ALGEBRA

define a vector by writing out its elements in the text inline as a ro roww matrix,
then using the transp transposeose op
operator
erator to turn it ininto
to a standard column vector, e.g.,
define x1a, xvector >by writing out its elements in the text inline as a row matrix,
x = [[x 2, x3 ] .
then using the transpose operator to turn it into a standard column vector, e.g.,
x =A[xscalar
, x , xcan ] .be thought of as a matrix with only >
a single entry
entry.. From this, we
can see that a scalar is its own transp transpose:
ose: a = a .
A scalar can be thought of as a matrix with only a single entry. From this, we
We can add matrices to each other, as long as they ha hav
ve the same shap shape,e, just
can see that a scalar is its own transpose: a = a .
by adding their corresp corresponding
onding elemen ts: C = A + B where Ci,j = Ai,j + Bi,j .
elements:
We can add matrices to each other, as long as they have the same shape, just
We can also add a scalar to a matrix or multiply a matrix by a scalar, just
by adding their corresp onding elements: C = A + B where C = A + B .
by performing that op operation
eration on eac h element of a matrix: D = a · B + c where
each
W e can also
Di,j = a · Bi,j + c. add a scalar to a matrix or multiply a matrix by a scalar, just
by performing that operation on each element of a matrix: D = a B + c where
D In= the a Bconcontext
text of deep learning, we also use some less conv
+ c.
conventional
entional notation.
·
We allo
alloww the addition of matrix and a vector, yielding another matrix: C = A + b,
In the ·
where Ci,j con
=A text+ofb deep
i,j
learning, we also use some less conventional notation.
j . In other words, the vector b is added to each row of the
We allowThis
matrix. the addition
shorthand of eliminates
matrix andthe a vector,
need toyielding
define aanother
matrixmatrix: C = Ain
with b copied +to
b,
into
where
eac
eachh ro Cw b=
row A doing
efore + b .the
In addition.
other words,Thisthe
implicit b is added
vectorcopying of bto
toeach
man
many yrow
lo of the
locations
cations
matrix.
is called brThis
bro oadcshorthand
adcasting
asting
asting.. eliminates the need to define a matrix with b copied into
each row before doing the addition. This implicit copying of b to many locations
is called broadcasting.
2.2 Multiplying Matrices and Vectors

2.2 of the
One Multiplying
most imp
important
ortantMatrices
op
operations and
erations inv
involvingVectors
olving matrices is multiplication of two
matrices. The matrix pr oduct of matrices A and B is a third matrix C . In order
pro
Onethis
for of the
pro mosttoimp
product
duct beortant
defined,opAerations
must havinveolving
have matrices
the same num
numb bis
er multiplication
of columns as B of has
two
matrices.
rows. If AThe
rows. is ofmatrix
shapee pr
shap mo×duct
n and B is of A
of matrices and
shap
shape e nB×isp,athen
thirdCmatrix C . In
is of shap
shapee morder
× p.
for this pro duct to b e defined,
We can write the matrix pro A
product m ust hav e the
duct just by placing twsame
twonum b er of columns as B
o or more matrices together, has
rows.
e.g. If A is of shap e m n and B is of shap e n p, then C is of shap e m p.
We can write the matrix × product just
C =byAB placing
. two or more matrices together,
× ×
(2.4)
e.g.
The pro
product
duct op eration is definedCb=
operation y AB . (2.4)
X
The product operation is defined Ci,j = by A i,k B k,j. (2.5)
k
C = A B . (2.5)
Note that the standard pro product
duct of tw
twoo matrices is not just a matrix con
containing
taining
the pro
product
duct of the individual elements. Suc Suchh an op
operation
eration exists and is called the
Note that
element-wise pr the
pro standard pro
oduct or Hadamar duct
Hadamard of
d pr
pro tw o matrices is not just
oduct, and is denoted as aAmatrix
 B . containing
the product of the individual elements. XSuch an operation exists and is called the
The dot pr
pro
o duct b etw
etween x y of the same dimensionalit
element-wise product >
or Hadamard productand
een tw o vectors , and is denoted asdimensionality
A B. y is the
matrix pro duct x y . We can think of the matrix pro
product duct C = AB as computing
product

Ci,j The dot dot
as the product
pro betwbeen
product
duct etwteen
etw ro
roww i ofxAand
wo vectors y column
and of the same
j of dimensionalit
B. y is the
matrix product x y . We can think of the matrix pro duct C = AB as computing
C as the dot pro duct between row i of 34 A and column j of B .
CHAPTER 2. LINEAR ALGEBRA

Matrix pro
product
duct op
operations
erations hav
havee many useful prop
properties
erties that make mathematical
analysis of matrices more con convenien
venien
venient.
t. For example, matrix m multiplication
ultiplication is
Matrix
distributiv
distributive:pro
e: duct op erations hav e many useful prop erties that make mathematical
analysis of matrices more A con
(Bvenien
+ C )t.= AB
For example,
+ AC . matrix multiplication is
(2.6)
distributive:
It is also asso
associativ
ciativ
ciative:
e: A(B + C ) = AB + AC . (2.6)
A(B C ) = (AB )C . (2.7)
It is also asso ciative:
Matrix multiplication is not commutativ
commutative
A(B C ) = (eAB (the)Ccondition
. AB = B A do does
es(2.7)
not
alw
alwaays hold), unlik
unlikee scalar multiplication. HoHowev
wev
wever,
er, the dot pro
product
duct betw
etween
een twtwoo
Matrix multiplication
vectors is comm
commutativ
utativ
utative:
e:is not commutativ e (the condition AB = B A do es not
always hold), unlike scalar multiplication.
x>y = yHo > wever, the dot pro duct b etween two
x. (2.8)
vectors is commutative:
The transp
transpose
ose of a matrix pro x y has
product
duct = y a xsimple
. form: (2.8)

The transpose of a matrix pro


(AB )> has
duct = Ba>simple
A >. form: (2.9)

This allows us to demonstrate (Eq.


allows AB )2.8=, bB A .
y exploiting the fact that the value of
(2.9)
suc
such
h a pro
product
duct is a scalar and therefore equal to its own transp
transpose:
ose:
This allows us to demonstrate Eq. 2.8, by exploiting the fact that the value of
 >
such a product is a scalar and y = x >yequal=toy>its
x>therefore x.own transpose: (2.10)

x y= x y = y x. (2.10)
Since the fo
focus
cus of this textb
textboook is not linear algebra, we do not attempt to
dev
develop
elop a comprehensive list of useful prop properties
erties of the matrix pro
product
duct here, but
Since the fo cus of this textb o ok is not
the reader should b e aware that many more exist. linear algebra, we do not attempt to
develop a comprehensive list of useful properties of the matrix product here, but
We no
noww kno
know  notation
w enough linear algebra  to write down a system of linear
the reader should b e aware that many more exist.
equations:
We now know enough linear algebra Axnotation
=b to write down a system of (2.11)
linear
equations:
where A ∈ R m×n is a known matrix,Ax b ∈=Rbm is a known vector, and x ∈ R(2.11) n
is a
vector of unknown
R variables we would like R elementt xi of xRis one
to solve for. Each elemen
where A is a known matrix,
of these unknown variables. Each row of b A is a
and eac known
h element of b and
each vector, pro x another
provide
vide is a
vector of t.
constrain unknown

constraint. We canvrewrite
ariables Eq.
we w2.11
ouldas: ∈ to solve for. Each element x of∈ x is one
like
of these unknown variables. Each row of A and each element of b provide another
constraint. We can rewrite Eq. 2.11Aas: 1,: x = b1 (2.12)

A2,: x = b2 (2.12)
(2.13)

A .x. .= b (2.14)
(2.13)
A m,:x
. . .= bm (2.15)
(2.14)
or, even more explicitly
even explicitly,, as: A x=b (2.15)
or, even more explicitly
A,1as:
,1 x1 + A 1,2x 2 + · · · + A 1,nx n = b1 (2.16)

A x +A 35
x + +A x =b (2.16)
···
CHAPTER 2. LINEAR ALGEBRA

 
1 0 0
 0 1 0
01 00 10
0 1 0
0 0 1
Figure 2.2: Example identity matrix: This is I 3 .
 
Figure 2.2: Example identity matrix: This is I .
A2,1 x1 + A 
2,2x 2 + · · · +A 2,nx n = b2 (2.17)

A x +A x + ... + A x = b (2.18)
(2.17)
A m,1x1 + Am,2x 2 +. .··.···· + A m,nxn = bm . (2.19)
(2.18)
Matrix-v
Matrix-vector Aductx notation
ector pro
product + A xpro +vides
provides x compact
+ aAmore = b . representation
(2.19)
for
equations of this form. ···
Matrix-vector product notation provides a more compact representation for
equations of this form.
2.3 Iden
Identit
tit
tity
y and In
Inverse
verse Matrices

2.3 algebra
Linear Identit y and
offers a powInerful
verse
owerful tool Matrices
called matrix inversion that allows us to
analytically solv
solvee Eq. 2.11 for many values of A.
Linear algebra offers a powerful tool called matrix inversion that allows us to
To describ
describee matrix in
inv
version, we first need to define the concept of an identity
analytically solve Eq. 2.11 for many values of A.
matrix
matrix.. An identit
identityy matrix is a matrix that do does
es not change any vector when we
To describ
multiply e matrix
that vector byin version,
that we first
matrix. need tothe
We denote define
identhe
identit
tity concept
tity of anpreserves
matrix that identity
matrix . An identit y matrix
as In.isFaormally
matrix, Ithat dones
×n,not
andchange any vector when we
n-dimensional vectors ormally, n∈R
multiply that vector by that matrix. We denote the identity matrix that preserves
R
∀x ∈ R,nI, In x = x., and
n-dimensional vectors as I . Formally (2.20)
R ∈
The structure of the identit
identity x
y matrix is ,simple:
I x = xall. of the entries along the(2.20)
main
∀ ∈ entries are zero. See Fig. 2.2 for an example.
diagonal are 1, while all of the other
The structure of the identity matrix is simple: all of the entries along the main
The matrix inverse of A is denoted as A−1, and it is defined as the matrix
diagonal are 1, while all of the other entries are zero. See Fig. 2.2 for an example.
suc
such
h that
The matrix inverse of A is denoted A −1 Aas = IA , and it is defined as the matrix
(2.21)
n.
such that
We can now solve Eq. 2.11 by A A = I . steps:
the following (2.21)

We can now solve Eq. 2.11 by the following steps:


Ax = b (2.22)
A −1 Ax
Ax =
=A
−1
b b (2.23)
(2.22)
−1
A In x
Ax AA b b
== (2.24)
(2.23)
36A b
I x= (2.24)
CHAPTER 2. LINEAR ALGEBRA

x = A−1b. (2.25)

Of course, this dep ends on it x


depends = Apossible
being b. to find A−1. We discuss (2.25)
the
−1
conditions for the existence of A in the follo following
wing section.
Of course,
−1
this depends on it being possible to find A . We discuss the
When Afor the
conditions exists, severalofdifferent
existence A in algorithms
the following exist for finding it in closed form.
section.
In theory
theory,, the same ininvverse matrix can then b e used to solv solvee the equation many
When A exists, several different
times for different values of b . How ever, A is primarilyfinding
However,algorithms
−1 exist for useful it
asina closed form.
theoretical
Inol,
to theory
tool, and ,should
the same
notinactually
verse matrix caninthen
b e used b e used
practice for to
mostsolvsoftw
e theare
equation
software many
applications.
times for different v alues of b . How ever, A is primarily useful as
Because A−1 can b e represented with only limited precision on a digital computer, a theoretical
tool, and should
algorithms not actually
that make use of bthee used
valuein of
practice
b can for mostobtain
usually software
moreapplications.
accurate
Because A
estimates of x.can b e represented with only limited precision on a digital computer,
algorithms that make use of the value of b can usually obtain more accurate
estimates of x.
2.4 Linear Dep endence and Span
Dependence
In for A−1 to
2.4orderLinear Dep
exist,endence
Eq. 2.11 must andhav Span
have e exactly one solution for every value
of b. How
However,
ever, it is also possible for the system of equations to hav havee no solutions
In order for A
or infinitely many solutions for some values of b. It is not possiblefor
to exist, Eq. 2.11 must hav e exactly one solution to every
ha
have value
ve more
of b. one
than Howbut
ever, it than
less is alsoinfinitely
possibleman
for ythe
many system for
solutions of equations
a particular tobhav
; if eboth
no solutions
x and y
or infinitely many
are solutions then solutions for some values of b . It is not p ossible to ha ve more
than one but less than infinitelyzman = αyxsolutions
+ (1 − α)for y a particular b ; if both x (2.26) and y
are solutions then
is also a solution for any real αz. = αx + (1 α)y (2.26)
To analyze ho howw man
many y solutions the equation − has, we can think of the columns
is also a solution for any real α.
of A as sp specifying
ecifying different directions we can tra traveve
vell from the origin (the point
sp T o
specified analyze ho w man y solutions the equation
ecified by the vector of all zeros), and determine ho has, wewcan
how many think
wa
waysof the
ys therecolumns
are of
of A
reac
reachingas sp ecifying different directions
hing b. In this view, each element of x sp we can tra
specifies ve
ecifies ho l from
how the origin
w far we should trav(the peloint
travel in
sp
eacecified
each by the
h of these vector ofwith
directions, all zeros),
xi sp and determine
specifying
ecifying how far ho towmomany
mov ve inwa ys direction
the there are of of
reachingi:b. In this view, each element of x specifies how far we should travel in
column X
each of these directions, with xAx sp=ecifyingx iA how. far to move in the direction (2.27) of
:,i
column i: i
Ax = xA . (2.27)
In general, this kind of op operation
eration is called a line linearar combination
ombination.. Formally
ormally,, a linear
com bination of some set of vectors {v (1) , . . . , v(n) } is given by multiplying each
combination
vector v(i) bthis
In general, y a kind of onding
corresp op eration
corresponding is called
scalar co a line
coefficien
efficien
efficient t ar
and combination
adding the. results:
Formally, a linear
combination of some set of vectors v , . . . , v is given by multiplying each
XX
vector v by a corresponding scalar{coefficien ci v . t and
( i ) } adding the results: (2.28)
i
cv . (2.28)
The sp
span
an of a set of vectors is the set of all points obtainable by linear combination
of the original vectors.
The span of a set of vectors is the set of all points obtainable by linear combination
of the original vectors. X37
CHAPTER 2. LINEAR ALGEBRA

Determining whether Ax = b has a solution th thus


us amoun
amounts ts to testing whether
b is in the span of the columns of A. This particular span is known as the column
sp
spac
acDetermining
acee or the range of A. Ax = b has a solution thus amounts to testing whether
whether
b is in the span of the columns of A. This particular span is known as the column
spacIn order
e or the rforange theofsystem
A. Ax = b to ha hav ve a solution for all values of b ∈ R m ,
we therefore require that the column space of A be all of R m . If any p oin ointt in RRm
In order from
is excluded for the columnAx
thesystem = b that
space, to hapvoint
e a solution for allvalue
is a potential R
values of of b has,
b that R
we solution.
no therefore require that the column
The requirement that the space of A space
column be all of of A b.eIfallany of pRoin
m timplies
in

is excluded from
immediately thatthe Am column
ust hav
have space,
e at leastthatmpoint is a pi.e.,
columns, otential
n≥m value of b that has
. Otherwise, the
R
no solution.
dimensionalit
dimensionality The requirement that the column space of
y of the column space would be less than m. For example, consider a A b e all of implies
3immediately
× 2 matrix. that The A mustb hav
target is 3-D, leastx m
e at but is columns,
only 2-D, i.e.,so mo
modifyingm . Otherwise,
n difying the value ofthe x
at best allows us to trace out a 2-D plane within R . The equation has a solutiona
dimensionalit y of the column space would b e less than
3 m . F or
≥ example, consider
3 and
if 2 matrix.
only if bThe liestarget
on that b isplane.
3-D, but x is only 2-D, so mo difying the value of x
R
at×best allows us to trace out a 2-D plane within . The equation has a solution
if and
Ha only nif ≥
Having
ving b lies
m isononly thata plane.
necessary condition for ev every
ery poin ointt to hahave
ve a solution.
It is not a sufficient condition, because it is possible for some of the columns to be
Having
redundan
redundant. t. nConsider
m is onlya 2 ×a 2necessary
matrix where conditionb othfor every
of the point to
columns arehaequal
ve a solution.
to each
other. This has the same column space as a 2 × 1 matrix containing only one to
It is not a sufficient
≥ condition, b ecause it is p ossible for some of the columns copbye
copy
redundan t. Consider a 2 2 matrix where b oth of the
of the replicated column. In other words, the column space is still just a line, and columns are equal to each
other.toThis
fails has theall
encompass same
of R×column
2
, ev
evenen space
though as there
a 2 1are matrix
tw
two containing only one copy
o columns.
of the replicated column. In other words, the × column space is still just a line, and
Formally
ormally,, this kind of R redundancy is known as line linear
ar depdependenc
endenc
endencee. A set of
fails to encompass all of , even though there are two columns.
vectors is linelinearly
arly indep
independent
endent if no vector in the set is a linear combination of the
other vectors. If we add of
F ormally , this kind redundancy
a vector to a setis that
known is aas linear
linear dependencof
combination e. the
A set of
other
vvectors
ectors isin line
thearlyset, indep endent
the new vectorif nodo ves
doesector
notinadd theany
set pisoints
a linearto thecombination
set’s span.ofThis the
other vthat
means ectors.for Ifthewecolumn
add a vector
space of to the
a set that is
matrix toaencompass
linear combination
all of Rm,ofthe thematrix
other
v ectors
must con in the set, the new vector do es not add
tain at least one set of m linearly independent columns. R
contain any p oints to the set’s span.
This condition This
means
is b oththat for theand
necessary column spacefor
sufficient of the
Eq. matrix
2.11 totoha encompass
hav ve a solution all of
for every, thevmatrix
alue of
bm.ust
Noteconthat
tain at theleast one set ofismforlinearly
requirement a set to independent
hav
havee exactly columns.
m linear Thisindepcondition
independent
endent
is b oth necessary
columns, not at least andmsufficient
. No set for of mEq. 2.11 to havveectors
-dimensional a solution
can hav fore every
have more v alue m
than of
b
m.utually
Note that the indep
linearly requirement
independen
enden is for a set
endentt columns, buttoahav e exactly
matrix with morem linear
thanindep endent
m columns
columns,
ma
may y ha
hav not at least m
ve more than one such set.. No set of m -dimensional v ectors can hav e more than m
mutually linearly indep endent columns, but a matrix with more than m columns
mayInhaorder
ve more for thethanmatrix
one suchto hav set.e an in
have invverse, we additionally need to ensure that
Eq. 2.11 has at most one solution for each value of b. To do so, we need to ensure
thatInthe order
matrixfor thehas matrix
at mosttomhav e an invOtherwise
columns. erse, we additionally
there is more needthanto ensure
one waway that
y of
Eq. 2.11 has
parametrizing eac at most
each one
h solution. solution for each value of b. T o do so, we need to ensure
that the matrix has at most m columns. Otherwise there is more than one way of
Together, this means that the matrix must be squar squaree, that is, we require that
parametrizing each solution.
m = n and that all of the columns must b e linearly indep independent.
endent. A square matrix
T ogether, this means that the
with linearly dependent columns is known as singular. matrix must b e squar e , that is, we require that
m = n and that all of the columns must b e linearly independent. A square matrix
If A is not square or is square but singular, it can still b e possible to solve the
with linearly dependent columns is known as singular.
38
If A is not square or is square but singular, it can still b e possible to solve the
CHAPTER 2. LINEAR ALGEBRA

equation. How
However,
ever, we can not use the metho
method d of matrix inv
inversion
ersion to find the
solution.
equation. However, we can not use the method of matrix inversion to find the
So far we hav
havee discussed matrix in
inv
verses as b eing multiplied on the left. It is
solution.
also possible to define an inv
inverse
erse that is multiplied on the righ
right:
t:
So far we have discussed matrix inverses as b eing multiplied on the left. It is
AAis−1multiplied
also possible to define an inverse that = I. on the right: (2.29)

For square matrices, the left inv AA


inverse
erse = I . inv
and right inverse
erse are equal. (2.29)

For square matrices, the left inverse and right inverse are equal.
2.5 Norms

2.5 Norms
Sometimes we need to measure the size of a vector. In mac
machine
hine learning, we usually
ormally,, the Lp norm
measure the size of vectors using a function called a norm. Formally
Sometimes
is given by we need to measure the size of a vector. In machine learning, we usually
given
measure the size of vectors using a function called ! 1a norm. Formally, the L norm
X p
is given by ||x|| p = |xi |p (2.30)
i
x = x (2.30)
for p ∈ R, p ≥ 1.
|| || | |
R including the L p norm, are functions mapping vectors to non-negative
Norms,
for p , p 1.
values. On an intuitiv
intuitivee lev
level,
el, the norm of a vector x measures the distance from
!mapping

Norms, ≥
including the L norm, are functions vectors to
the origin to the poinointt x. More rigorously X
rigorously,, a norm is any function f non-negative
that satisfies
values.
the folloOn
following
winganprop
intuitiv
properties:e level, the norm of a vector x measures the distance from
erties:
the origin to the point x. More rigorously, a norm is any function f that satisfies
the•follo
f (xwing
) = 0prop
⇒x erties:
=0

• f (x)+=y0) ≤ fx(x=) 0 + f (y ) (the triangle ine inequality


quality
quality))
• ⇒
• ∀
fα(x∈+Ry, )f (αx |αf|f((yx))(the triangle inequality)
f ()x=) +
• R ≤
Theα L2 norm,
, f (αxwith
)= α p= f (2x
2,,) is known as the EuclideEuclidean an norm
norm.. It is simply the
• ∀ ∈ distance from
Euclidean | |the origin to the poin ointt iden tified by x. The L 2 norm is
identified
The L norm, with
used so frequently in mac p =
machine 2 , is known as the Euclide
hine learning that it is often denoted an norm . It isassimply
simply the
||x||, with
Euclidean
the subscriptdistance from the
2 omitted. It isorigin to the poin
also common t identified
to measure theby x. ofThe
size L norm
a vector is
using
usedsquared
the so frequently
L2 norm,in mac hine
whic
which h canlearning that it is simply
b e calculated often denoted
as x>x.simply as x , with
the subscript 2 omitted. It is also common to measure the size of a vector || || using
L2
the The squared
squared L norm,norm whicis moreb econv
h can
2
convenient
enient to
calculated workaswith
simply x xmathematically
. and
computationally than the L norm itself. For example, the deriv derivatives
atives of the
squared L2 normL with
The squared normrespis
respect more
ect to conv
each enient
elementto work
of x with
eac
each mathematically
dep
h dependend only on and the
computationally
corresp
corresponding than the L norm itself.
elementt of x, while all of the deriv
onding elemen For example,
derivativ
ativ
atives the deriv
2 atives
es of the L norm dep of the
depend
end
squared
on the en L
entire norm with resp ect to each element
tire vector. In many contexts, the squared L norm ma of x 2 each dep
may end only on
y be undesirable the
corresponding element of x, while all of the derivatives of the L norm depend
on the entire vector. In many contexts,39 the squared L norm may be undesirable
CHAPTER 2. LINEAR ALGEBRA

because it increases very slowly near the origin. In sev several


eral machine learning
applications, it is imp
importan
ortan
ortantt to discriminate b et etwween elements that are exactly
because
zero and it increases
elements thatvery slowlybut
are small near the origin.
nonzero. In theseIncases,
several
we machine
turn to a learning
function
applications,
that gro
grows it is imp ortan t to discriminate
ws at the same rate in all lo locations, b et ween elements that are exactly
cations, but retains mathematical simplicity:
zero and elements that are
the L1 norm. The L1 norm ma small
may but nonzero.
y be simplified toIn these cases, we turn to a function
that grows at the same rate in all locations, but retains mathematical simplicity:
X
the L norm. The L norm may ||bxe||simplified
1= |xito
|. (2.31)
i
x = x . (2.31)
The L1 norm is commonly used in machine
|| || learning
| | when the difference b etw
etween
een
zero and nonzero elements is very imp importan
ortant. Every time an element of x mo
ortant. mov ves
aThe norm
wayLfrom 0 bis
y commonly
, the L1 normusedincreases
in machineby learning
. when the difference between
zero and nonzero elements is very importan X t. Every time an element of x moves
We sometimes measure the size of the vector by coun counting
ting its num umb ber of nonzero
away from 0 by , the L norm increases by . 0 norm,” but this is incorrect
elemen
elements.
ts. Some authors refer to this function as the “L
We sometimes
terminology
terminology. . The measure
num
numb ber the size of the
of non-zero vectorinbya coun
entries vectortingis its
notnum ber of bnonzero
a norm, ecause
elemen ts. Some authors
scaling the vector by α do refer
does to this function
es not change the num as the
umb “L norm,”
ber of nonzero enbut this is
entries. incorrect
tries. The L 1
terminology . The num b er of non-zero entries in a
norm is often used as a substitute for the number of nonzero en v ector is not a norm,
entries.
tries. because
scaling the vector by α does not change the number of nonzero entries.∞The L
One other norm that commonly arises in machine learning is the L norm,
norm is often used as a substitute for the number of nonzero entries.
also known as the max norm. This norm simplifies to the absolute value of the
Onet with
elemen
element otherthe
norm thatmagnitude
largest commonlyinarises in machine learning is the L norm,
the vector,
also known as the max norm. This norm simplifies to the absolute value of the
element with the largest magnitude ||x||∞in=the vector,
max |xi |. (2.32)
i
x = max x . (2.32)
Sometimes we may also wish to measure the size of a matrix. In the con context
text
|| || | |
of deep learning, the most common wa way
y to do this is with the otherwise obscure
FrobSometimes
obenius
enius normwe may also wish to measure the size of a matrix. In the context
sX
of deep learning, the most common way to do this is with the otherwise obscure
||A|| F = A 2i,j , (2.33)
Frobenius norm
i,j
A = A , (2.33)
whic
whichh is analogous to the L2 norm
|| || of a vector.

whicThe
h isdot producttoofthe
analogous twoLvectors
normcan of abevector.
rewritten in terms of norms. Sp Specifically
ecifically
ecifically,,
sX
x>y =
The dot product of two vectors can||xbe y|| 2 cos θin terms of norms. Specifically
rewritten
|| 2|| (2.34),

where θ is the angle between x


etween y = yx
x and . y cos θ (2.34)
|| || || ||
where θ is the angle between x and y .
2.6 Sp
Special
ecial Kinds of Matrices and Vectors

2.6 sp
Some Sp ecial
special
ecial kindsKinds of and
of matrices Matrices
vectors areand Vectors
particularly useful.
40 are particularly useful.
Some special kinds of matrices and vectors
CHAPTER 2. LINEAR ALGEBRA

Diagonal matrices consist mostly of zeros and hav havee non-zero entries only along
the main diagonal. F ormally,, a matrix D is diagonal if and only if Di,j = 0 for
Formally
ormally
all iDiagonal
6= j . W Weematrices
hav consistseen
havee already mostly one ofexample
zeros and of hav e non-zero
a diagonal entriesthe
matrix: only alongy
identit
identity
the main
matrix, diagonal.
where Formally
all of the , a entries
diagonal D
matrixare is 1. diagonal
We write ifdiag and
diag( (v)only if
to denoteD 0 for
a=square
all i = j . W e hav e already seen one example
diagonal matrix whose diagonal entries are given by the en of a diagonal tries of the vector vy.
matrix:
entries the identit
matrix,
6 where
Diagonal all ofare
matrices theofdiagonal
interest entries
in partare write diag(vby
1. Wemultiplying
because ) to denote a square
a diagonal matrix
diagonal matrix whose
is very computationally efficien diagonal
efficient. entries are
t. To compute diag given b y the en tries of the
diag((v)x , we only need to scale each vector v.
Diagonal t xmatrices are of interest indiag( part
( vb
)xecause
= v multiplying by aa square
diagonal matrix
elemen
element i by v i. In other words, diag x. Inv
Inverting
erting diagonal
is very computationally
matrix is also efficient. The inv efficien t.
inverseT o compute
erse exists only if ev diag ( v )
everyx , we only need to scale
ery diagonal entry is nonzero, each
elemen x bycase,
in tthat v . In other words,
( v) −1 = diag diag
([1(/v
([1/vv)x, . = v x.>Inverting a square diagonal
and diag
diag( diag([1 1 . . , 1/vn] ). In many cases, we may
matrix
deriv is also
derivee some efficient.
very generalThe mac inv
hineerse
machine exists algorithm
learning only ifevery diagonal
in terms entry is matrices,
of arbitrary nonzero,
and obtain
but in thatacase, ( v) (and
diagensive
less exp
expensive = diag ([1descriptiv
less /v , . . . , 1e)
descriptive) /v algorithm
] ). In many cases, we some
by restricting may
deriv e some v ery
matrices to be diagonal. general mac hine learning algorithm in terms of arbitrary matrices,
but obtain a less expensive (and less descriptive) algorithm by restricting some
Not all diagonal matrices need be square. It is p ossible to construct a rectangular
matrices to be diagonal.
diagonal matrix. Non-square diagonal matrices do not hav havee inv
inverses
erses but it is still
Not all diagonal matrices
possible to multiply by them cheaply need b e square. It is p ossible
cheaply.. For a non-square diagonal matrixto construct a rectangular
D , the
diagonal
pro matrix.
duct Dx will in
product Non-square
inv olvee scaling each element of x , and either concatenatingissome
volv diagonal matrices do not hav e inv erses but it still
possible
zeros to tothemultiply
result ifby Dthem cheaply
is taller than. it Foris awide,
non-square diagonal
or discarding matrix
some of theD , last
the
producttsD
elemen
elements ofxthe
willvector
involvife scaling
D is wider eachthanelement x , and either concatenating some
it isoftall.
zeros to the result if D is taller than it is wide, or discarding some of the last
A symmetric
elemen matrixif is
ts of the vector Dany matrix
is wider thatit isis equal
than tall. to its own transp transpose:
ose:

A symmetric matrix is any matrix A= A>is. equal to its own transpose: (2.35)
that
Symmetric matrices often arise whenAthe = entries
A . are generated by some function of
(2.35)
two argumen
arguments ts that do
does
es not dep
depend
end on the order of the arguments. For example,
Symmetric matrices
if A is a matrix often arise
of distance when
measuremen thets,entries
with Aare generated by some function of
measurements, i,j giving the distance from p oint
itwto
opargumen
oin ts that
ointt j , then do=
Ai,j es A
not dep end on the order of the arguments. For example,
j,i b ecause distance functions are symmetric.
if A is a matrix of distance measurements, with A giving the distance from point
A unit vevector
ctor is a vector with unit norm:
i to p oint j , then A = A because distance functions are symmetric.
A unit vector is a vector with unit ||x||norm
2=1 1.
:. (2.36)

x = 1. (2.36)
>
A vector x and a vector y are orthoorthogonal
|| || gonal to each other if x y = 0. If both
each
vectors ha
hav
ve nonzero norm, this means that they are at a 90 degree angle to each
A vectorn x and a vector y
other. In R , at most n vectors ma are
mayortho gonal to each x ynonzero
other ifwith
y b e mutually orthogonal = 0. If norm.
both
vectors
If the vha ve nonzero norm, this means that they are
ectors at unit
a 90 norm,
degree we
angle
calltothem
each
R are not only orthogonal but also ha have
ve
other. In
orthonormal
orthonormal.. , at most n vectors ma y b e mutually orthogonal with nonzero norm.
If the vectors are not only orthogonal but also have unit norm, we call them
An ortho
orthogonal
orthonormal gonal
. matrix is a square matrix whose rows are mutually orthonormal
and whose columns are mutually orthonormal:
An orthogonal matrix is a square matrix whose rows are mutually orthonormal
and whose columns are mutually A>orthonormal:
A = AA> = I . (2.37)

A A = 41
AA = I . (2.37)
CHAPTER 2. LINEAR ALGEBRA

This implies that


A −1 = A> , (2.38)
This implies that
so orthogonal matrices are of interest
Abecause
= A their
, inv
inverse
erse is very cheap to compute.
(2.38)
Pay careful atten
attention
tion to the definition of orthogonal matrices. Counterin
Counterintuitively
tuitively
tuitively,,
so orthogonal
their rows arematrices are of
not merely interest because
orthogonal their
but fully inverse is very
orthonormal. cheapistonocompute.
There sp
special
ecial
term for a matrix whose rows or columns are orthogonal but not orthonormal. ,
P ay careful attention to the definition of orthogonal matrices. Counterin tuitively
their rows are not merely orthogonal but fully orthonormal. There is no special
term for a matrix whose rows or columns are orthogonal but not orthonormal.
2.7 Eigendecomp
Eigendecomposition
osition

2.7 y mathematical
Man
Many Eigendecomp ob osition
objects
jects can be understo
understoo o d better by breaking them in into
to
constituen
constituentt parts, or finding some properties of them that are univ universal,
ersal, not caused
Man y mathematical
by the way we cho hoose ob jects
ose to represencan b
representt them.e understo o d b etter by breaking them into
constituent parts, or finding some properties of them that are universal, not caused
For wexample,
by the ay we choin integers
tegers
ose can bet decomp
to represen decomposed
them. osed in into
to prime factors. The wa wayy we
represen
representt the num
numb ber 12 will change dep depending
ending on whether we write it in base ten
or inFor example,
binary
binary,, but itin tegers
will alwa can
ys bb
always e etrue
decomp
that 12osed in×to2 ×prime
= 22× 3. Fromfactors. The way we
this representation
represen
w t the num
e can conclude ber 12prop
useful willerties,
change
properties, suc
suchdep
h asending
that 12oniswhether we write
not divisible by 5it, or
in that
base an
teny
any
or
in in
integerbinary , but it will alwa ys b e true
teger multiple of 12 will be divisible by 3. that 12 = 2 2 3 . From this representation
we can conclude useful properties, such as that 12×is not × divisible by 5, or that any
Muc
Much h as we can disco
discov ver something ab about
out the true nature of an integer by
integer multiple of 12 will be divisible by 3.
decomp
decomposing
osing it into prime factors, we can also decomp decompose ose matrices in ways that
sho
show Muc h as we can
w us information ab disco
about ver something
out their functional propab out erties thatnature
the
properties true of vious
is not ob an integer
obvious by
from the
decomp
represen osing
tationitofinto
representation theprime
matrixfactors, we can
as an array of also decompose matrices in ways that
elements.
show us information about their functional properties that is not obvious from the
One of the most widely used kinds of matrix decomp decomposition
osition is called eigen-
representation of the matrix as an array of elements.
de
deccomp
omposition
osition
osition,, in whic
which h we decomp
decompose ose a matrix in into
to a set of eigenv
eigenvectors
ectors and
eigenOne
eigenv of
values.the most widely used kinds of matrix decomp osition is called eigen-
decomposition, in which we decompose a matrix into a set of eigenvectors and
eigenAn eigenve ctor of a square matrix A is a non-zero vector v suc
eigenvector
values. h that multipli-
such
cation by A alters only the scale of v:
An eigenvector of a square matrix A is a non-zero vector v such that multipli-
cation by A alters only the scale of Av v: = λv. (2.39)

The scalar λ is known as the eigenvalue Av =corresponding


λv . to this eigen
eigenv vector. (2.39)
(One
v > A = λv >
can also find a left eigenve
eigenvector
ctor suc
suchh that , but we are usually concerned
The scalar
with righ λ is known
rightt eigen
eigenv as the eigenvalue corresponding to this eigenvector. (One
vectors).
can also find a left eigenvector such that v A = λv , but we are usually concerned
v ist an
withIf righ eigenv ector of A, then so is an
eigenvector
eigen vectors). any y rescaled vector sv for s ∈ R, s 6 = 00..
Moreo
Moreov ver, sv still has the same eigenv
eigenvalue.
alue. For this reason, we usually only R lo look
ok
If v is
for unit eigenan eigenv
eigenvectors.ector
vectors. of A, then so is an y rescaled vector s v for s , s = 0.
Moreover, sv still has the same eigenvalue. For this reason, we usually∈only 6look
Supp
Suppose ose that a matrix A has n linearly indep independen
enden
endentt eigenv ectors, {v (1) , . . . ,
eigenvectors,
for unit eigenvectors.
v(n) } , with corresp
corresponding
onding eigenv alues {λ1, . . . , λn} . We ma
eigenvalues may y concatenate all of the
Suppose that a matrix A has n linearly indep endent eigenvectors, v , . . . ,
v , with corresponding eigenvalues λ42, . . . , λ . We may concatenate{all of the
} { }
CHAPTER 2. LINEAR ALGEBRA

E"ect of eigenvectors and eigenvalues

Before multiplication After multiplication


3 3

¸1 v(1)
2 Before multiplication 2 After multiplication

1 1
v (1) v (1)

0 0
x1

x10
¸ 2v (2)
(2)
v v (2)
−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x0 x00

Figure 2.3: An example of the effect of eigen eigenvectors


vectors and eigenv
eigenvalues.
alues. Here, we havhavee
a matrix A with tw twoo orthonormal eigenv ectors, v (1) with eigenv
eigenvectors, alue λ1 and v (2) with
eigenvalue
Figure
eigen
eigenv 2.3:λ 2An
value . (L example of the
eft) We plot
(Left) theeffect
set ofofalleigen
unitvectors
vectorsand
u ∈eigenv
R 2 asalues.
a unit Here,
circle.w(R
e hav
(Right)e
ight)
a matrix A with tw o orthonormal
We plot the set of all points Au eigenv ectors, v with eigenv alue λ and v
. By observing the way that Rdistorts the unit circle, we
A with
eigensee
can value
thatλ it. (L eft) W
scales e plot
space in the (i) unit vectors u
set of vall
direction by λi . as a unit circle. (Right)
We plot the set of all points Au A
. By observing the way that∈ distorts the unit circle, we
can see that it scales space in direction v by λ .
eigen
eigenv vectors to form a matrix V with one eigen
eigenv
vector per column: V = [[v v(1) , . . . ,
v(n) ]. Likewise, we can concatenate the eigenv alues to form a vector λ = [[λ
eigenvalues λ1 , . . . ,
eigen
> vectors to
λn ] . The eigendeform a
eigendeccompmatrix
omposition V with one eigen
osition of A is then giv v
given ector
en by p er column: V = [ v ,...,
v ]. Likewise, we can concatenate the eigenvalues to form a vector λ = [ λ , . . . ,
−1
λ ] . The eigendecomposition A of =
AV is diag
then(λgiv
diag( )Ven by
. (2.40)

We hav
havee seen that constructing V diag(λwith
A = matrices )V sp . ecific eigenv
specific eigenvalues (2.40)
alues and eigenv
eigenvec-
ec-
tors allo
allows
ws us to stretch space in desired directions. Ho Howev
wev
wever,
er, we often wan antt to
de W
deccompe hav
ompose e seen that constructing
ose matrices into their eigen
eigenvmatrices with sp
values and eigenv ecific
ectors. Doing so can help ec-
eigenvectors.eigenv alues and eigenv us
torsanalyze
to allows certain
us to stretch
prop space
properties
erties of in
thedesired
matrix,directions.
muc
much However,
h as decomp
decomposing wean
osing often waninto
integer t to
decomp
its primeosefactors
matrices
caninto
helptheir eigenvaluesthe
us understand andbehavior
eigenvectors.
of thatDoing
in so can help us
integer.
teger.
to analyze certain prop erties of the matrix, much as decomposing an integer into
Not every matrix can b e decomp
decomposedosed in
into
to eigenv
eigenvalues
alues and eigenv
eigenvectors.
ectors. In some
its prime factors can help us understand the behavior of that integer.
Not every matrix can b e decomposed43into eigenvalues and eigenvectors. In some
CHAPTER 2. LINEAR ALGEBRA

cases, the decomp


decomposition
osition exists, but ma may y in
inv
volv
olvee complex rather than real numbers.
Fortunately
ortunately,, in this b ook, we usually need to decomp decompose ose only a sp specific
ecific class of
cases, the
matrices that hadecomp osition
have exists,
ve a simple decomp but ma y
decomposition. inv olv
osition. Sp e complex
Specifically
ecifically rather
ecifically,, ev ery realreal
everythan numbers.
symmetric
Fortunately
matrix can b, eindecomposed
this b ook, we intousually need to using
an expression decomp osereal-v
only only alued
a specific
real-valued eigen class
eigenvectors
vectors of
matrices
and eigen
eigenv that
values: ha ve a simple decomp osition. Sp ecifically , ev ery real symmetric
matrix can b e decomposed into anAexpression = QΛQ>,using only real-valued eigenvectors (2.41)
and eigenvalues:
where Q is an orthogonal matrixAcomp composed
= QΛ osed
Q ,of eigenv ectors of A, and Λ(2.41)
eigenvectors is a
diagonal matrix. The eigen eigenv value Λi,i is asso associated
ciated with the eigen eigenv vector in column i
where
of Q is anasorthogonal
Q, denoted Q:,i. Because matrixQ iscomp osed of eigenv
an orthogonal ectors
matrix, we ofcanAthink
, and of ΛAis as a
diagonalspace
scaling matrix.by λThe eigenvalue Λ(i) is associated with the eigenvector in column i
i in direction v . See Fig. 2.3 for an example.
of Q, denoted as Q . Because Q is an orthogonal matrix, we can think of A as
While an
any ybreal A guaran
scaling space y λ symmetric
in directionmatrixv . SeeisFig. guaranteed
2.3 teed
for antoexample.
ha
havve an eigendecomp
eigendecomposi- osi-
tion, the eigendecomp
eigendecompositionosition ma may y not be unique. If any tw two o or more eigenveigenvectors
ectors
share While any real
the same symmetric
eigenv
eigenvalue,
alue, then matrix
an
any A isofguaran
y set teed to
orthogonal have an
vectors eigendecomp
lying in their span osi-
tion,also
are theeigenv
eigendecomp
eigenvectors osition
ectors with thatmaeigenv
y notalue,
eigenvalue,be unique.
and weIfcould
any tw o oralently
equiv more eigenv
equivalently chooseectors
aQ
share the same
using those eigenv eigenv
eigenvectors alue, then an
ectors instead. By con y set
conv of
venorthogonal
ention, vectors lying
tion, we usually sort the en in their
entries span
tries of Λ
are also eigenv ectors with
in descending order. Under this conv that eigenv alue,
convention, and w e could
ention, the eigendecomp equiv
eigendecomposition alently choose
osition is unique only aQ
using
if all ofthose eigenvalues
the eigenv ectorsare
eigenvalues unique.By convention, we usually sort the entries of Λ
instead.
in descending order. Under this convention, the eigendecomposition is unique only
The eigendecomp
eigendecomposition osition of a matrix tells us man many y useful facts about the
if all of the eigenvalues are unique.
matrix. The matrix is singular if and only if any of the eigenv eigenvalues
alues are 0. The
The
eigendecomp eigendecomp
eigendecomposition osition of a matrix tells
osition of a real symmetric matrix can also be used us man y useful factstoabout optimizethe
matrix. The
quadratic matrix isof singular
expressions the form iff (and x) =onlyx> Axif any
sub of the
subject
ject eigenv
to || x|| 2 =alues are 0. The
11.. Whenever x
eigendecomp osition
is equal to an eigenv of a real
ector of A, f tak
eigenvector symmetric
takes matrix can also be used
es on the value of the corresponding eigenv to optimize
eigenvalue.
alue.
quadratic
The maxim
maximum expressions of the form f
um value of f within the constrain( x ) = x Ax sub ject to x
constraintt region is the maximum eigenv= 1 . Whenever
eigenvalue x
alue
is equal to
and its minim an
minimum eigenv ector of A , f tak es on the value
um value within the constraint region is the||minim of the corresponding
|| um eigen
minimum eigenv
eigenv alue.
value.
The maximum value of f within the constraint region is the maximum eigenvalue
A matrix whose eigen eigenvvalues are all positive is called positive definite definite.. A matrix
and its minimum value within the constraint region is the minimum eigenvalue.
whose eigenv
eigenvalues
alues are all positivositivee or zero-v
zero-valued
alued is called positive semidefinite semidefinite..
Lik A
Likewise,matrix whose
ewise, if all eigeneigenv eigen v alues are
values are negativ all
negative, p ositive is called
e, the matrix is ne p ositive
negative definite
gative definite
definite,. A, matrix
and if
whose
all eigen
eigenveigenv alues are all p
values are negative or zero-vositiv e or zero-v
zero-valued, alued
alued, it is ne is called
negative p ositive
gative semidefinite semidefinite
semidefinite.. Positiv Positivee.
Likewise, if all
semidefinite eigenvare
matrices alues are negativ
interesting e, thethey
because matrix is negative
guarantee that ∀definite
x, x>Ax , and ≥ 0if.
allositiv
P eigen
ositive e vdefinite
alues are negative
matrices or zero-valued,
additionally guaranteeit is that
negative
x>Ax semidefinite
= 0 ⇒ x =. 0Positiv . e
semidefinite matrices are interesting because they guarantee that x, x Ax 0.
Positive definite matrices additionally guarantee that x Ax = 0 ∀ x = 0. ≥
2.8 Singular Value Decomp Decomposition osition ⇒

2.8Sec. 2.7
In Singular
, we sa
saw
w hoVwalue
how Decomp
to decomp
decompose osition
ose a matrix in
into
to eigen
eigenv
vectors and eigen
eigenv
values.
The singular value dedeccomp
omposition
osition (SVD) pro
provides
vides another wa
wayy to factorize a matrix,
In
in Sec.
into 2.7 , w
to singular vee saw
vectorsho w to decomp ose a matrix into eigenvectors
ctors and singular values. The SVD allows us to disco and veigen
discov values.
er some of
The singular value de comp osition (SVD) pro
the same kind of information as the eigendecomp vides another
eigendecomposition. wa y
osition. Hoto
How factorize
wevever, a matrix,
er, the SVD is
into singular vectors and singular values. The SVD allows us to discover some of
the same kind of information as the eigendecomp
44 osition. However, the SVD is
CHAPTER 2. LINEAR ALGEBRA

more generally applicable. Every real matrix has a singular value decomp decomposition,
osition,
but the same is not true of the eigenv
eigenvalue
alue decomp
decomposition.
osition. For example, if a matrix
more
is not generally applicable.
square, the eigendecompEvery
eigendecomposition real matrix
osition has a singular
is not defined, and wevm alue
ustdecomp osition,
use a singular
but
valuethe same is
decomp not true
decomposition
osition of the eigenvalue decomposition. For example, if a matrix
instead.
is not square, the eigendecomposition is not defined, and we must use a singular
Recall that the eigendecomp
eigendecomposition
osition in
involv
volv es analyzing a matrix A to disco
volves discov
ver
value decomposition instead.
a matrix V of eigen
eigenv
vectors and a vector of eigen values λ suc
eigenv such
h that we can rewrite
A asRecall that the eigendecomp osition involves analyzing a matrix A to discover
a matrix V of eigenvectors andAa = vector
V diagof(eigen
diag( values
λ)V −1 . λ such that we can rewrite
(2.42)
A as
The singular value decomp A = V diag
decomposition
osition (λ)V except
is similar, . (2.42)
this time we will write A
as a product of three matrices:
The singular value decomposition is similar, except this time we will write A
as a product of three matrices: A = U DV >. (2.43)

A = U DV . (2.43)
Supp ose that A is an m × n matrix. Then U is defined to be an m × m matrix,
Suppose
D to be an m × n matrix, and V to be an n × n matrix.
Suppose that A is an m n matrix. Then U is defined to be an m m matrix,
Eac
Each h of these matrices is defined to hav havee a sp
special
ecial structure. The matrices U
D to be an m n matrix, and × V to be an n n matrix. ×
and V are both defined to b e orthogonal matrices. The matrix D is defined to be
Each ofmatrix.× matrices is defined to have a×special structure. The matrices U
these
a diagonal Note that D is not necessarily square.
and V are both defined to b e orthogonal matrices. The matrix D is defined to be
The elemen
elementsts along the diagonal of D are kno known wn as the singular values of the
a diagonal matrix. Note that D is not necessarily square.
matrix A. The columns of U are kno known
wn as the left-singular ve vectors
ctors
ctors.. The columns
The elemen
of V are kno
known ts along the diagonal of
wn as as the right-singular ve D are
vectorskno
ctors
ctors.. wn as the singular values of the
matrix A. The columns of U are known as the left-singular vectors. The columns
of VWare
e can
knoactually
wn as asinterpret the singular
the right-singular value
vectors . decomp osition of A in terms of
decomposition
the eigendecomposition of functions of A . The left-singular vectors of A are the
Wveectors
eigen
eigenv can actually
of AA> .interpret
The righ the singular
right-singular
t-singular valueofdecomp
vectors osition
A are the eigenof A in terms
eigenvectors
vectors of A>A of.
the eigendecomposition of functions of A .
The non-zero singular values of A are the square roThe left-singular
roots v ectors
ots of the eigen
eigenv of A are the
values of A>A.
eigenv ectors of AA .
The same is true for AA . The> righ t-singular vectors of A are the eigen vectors of A A.
The non-zero singular values of A are the square roots of the eigenvalues of A A.
Perhaps the most useful feature of the SVD is that we can use it to partially
The same is true for AA .
generalize matrix in inversion
version to non-square matrices, as we will see in the next
P erhaps
section. the most useful feature of the SVD is that we can use it to partially
generalize matrix inversion to non-square matrices, as we will see in the next
section.
2.9 The Mo
Moore-P
ore-P
ore-Penrose
enrose Pseudoin
Pseudoinverse
verse

2.9 in
Matrix The
inv
versionMois ore-P enrose
not defined Pseudoin
for matrices that areverse
not square. Supp
Suppose
ose we wan
wantt
to mak
makee a left-inv
left-inverse
erse B of a matrix A, so that we can solve a linear equation
Matrix inversion is not defined for matrices that are not square. Suppose we want
to make a left-inverse B of a matrixAx A, =so ythat we can solve a linear equation
(2.44)

Ax = y (2.44)
45
CHAPTER 2. LINEAR ALGEBRA

by left-m
left-multiplying
ultiplying eac
each
h side to obtain

by left-multiplying each side to obtain


x = By. (2.45)

Dep ending on the structure of the x


Depending = B y . it ma
problem, may (2.45)a
y not be possible to design
unique mapping from A to B .
Depending on the structure of the problem, it may not be possible to design a
If A is taller than it is wide, then it is possible for this equation to hav havee
unique mapping from A to B .
no solution. If A is wider than it is tall, then there could be multiple possible
If A is taller than it is wide, then it is possible for this equation to have
solutions.
no solution. If A is wider than it is tall, then there could be multiple possible
The Mo
Moor
or
ore-Penr
e-Penr
e-Penrose
ose pseudoinverse allows us to make some headwa
headway y in these
solutions.
cases. The pseudoinv
pseudoinverse
erse of A is defined as a matrix
The Moore-Penrose pseudoinverse allows us to make some headway in these
cases. The pseudoinverse A of+A=islim (A>Aas+aαmatrix
defined I ) −1 A> . (2.46)
α&0
A = lim (A A + αI ) A . (2.46)
Practical algorithms for computing the pseudoinv
pseudoinverse
erse are not based on this defini-
tion, but rather the formula
Practical algorithms for computing the pseudoinverse are not based on this defini-
tion, but rather the formula A + = V D +U >, (2.47)

where U , D and V are the singular A value


= Vdecomp
D U osition
,
decomposition of
ofAA , and the pseudoin (2.47)
pseudoinverse
verse
+
D of a diagonal matrix D is obtained by taking the recipro reciprocal
cal of its non-zero
where U
elemen ts, D
elements and
then V arethe
taking thetransp
singular
osevalue
transpose decomp
of the osition
resulting of A , and the pseudoinverse
matrix.
D of a diagonal matrix D is obtained by taking the reciprocal of its non-zero
When A has more columns than rows, then solving a linear equation using the
elements then taking the transp ose of the resulting matrix.
pseudoin
pseudoinv verse provides one of the man
many y possible solutions. Specifically
Specifically,, it pro
provides
vides
When A has more
+ columns than rows, then solving a linear equation
the solution x = A y with minimal Euclidean norm ||x||2 among all possible using the
pseudoinverse provides one of the many possible solutions. Specifically, it provides
solutions.
the solution x = A y with minimal Euclidean norm x among all possible
When A has more rows than columns, it is possible for there to be no solution.
solutions. || ||
In this case, using the pseudoinv erse gives us the x for which Ax is as close as
pseudoinverse
WhentoAy has
possible more rows
in terms than columns,
of Euclidean norm ||itAx
is p−ossible
y||2 . for there to be no solution.
In this case, using the pseudoinverse gives us the x for which Ax is as close as
possible to y in terms of Euclidean norm Ax y .
2.10 The Trace Op Operator
erator || − ||

2.10traceThe
The Trace
operator gives Op erator
the sum of all of the diagonal en
entries
tries of a matrix:
X
The trace operator gives the sum
Tr(ofAall
) =of theAdiagonal
i,i .
entries of a matrix: (2.48)
i
Tr(A) = A . (2.48)
The trace opoperator
erator is useful for a variety of reasons. Some op
operations
erations that are
difficult to sp
specify
ecify without resorting to summation notation can b e sp specified
ecified using
The trace operator is useful for a variety of reasons. Some operations that are
46X
difficult to specify without resorting to summation notation can b e specified using
CHAPTER 2. LINEAR ALGEBRA

matrix pro
products
ducts and the trace opoperator.
erator. For example, the trace op
operator
erator provides
an alternativ
alternativee way of writing the Frob
robenius
enius norm of a matrix:
matrix products and the trace operator. For example, the trace operator provides
q
an alternative way of writing the Frobenius norm of a matrix:
||A||F = Tr( AA> ).
r(AA (2.49)

A = Tr(AA ). (2.49)
Writing an expression in terms of the trace op operator
erator op
opens
ens up opp
opportunities
ortunities to
manipulate the expression using || ||man
manyy useful identities. F
For
or example, the trace
op W riting
operator an
erator is in
inv expression
varian in terms
ariantt to the transp of the
transpose
ose op trace operator opens up opp ortunities to
operator:
erator:
manipulate the expression using manyquseful identities. For example, the trace
operator is invariant to the transp Tr(ose
A) op r(A> ).
= erator:
T
Tr( (2.50)

Tr(A) = Tr(A ). (2.50)


The trace of a square matrix compcomposed
osed of many factors is also ininv
varian
ariantt to
mo
moving
ving the last factor into the first p osition, if the shap
shapes
es of the corresp
corresponding
onding
The trace
matrices allo
allow
w of a
the square matrix
resulting pro comp
product
duct to osed
b e of many
defined: factors is also invarian t to
moving the last factor into the first p osition, if the shapes of the corresp onding
matrices allow the resulting
Tr(AB
ABCpro
C )duct
=T toCbAB
Tr(
r( e defined:
)=TTr(
r(B C A) (2.51)

or more generally
generally,, Tr(AB C ) = Tr(C AB) = Tr(B C A) (2.51)
n
Y n−1
Y
or more generally, Tr( F (i)) = T
Tr(
r(
r(FF (n) F (i) ). (2.52)
i=1 i=1
Tr( F ) = Tr(F F ). (2.52)
This inv
invariance
ariance to cyclic perm
ermutation
utation holds even if the resulting pro
product
duct has a
differen
differentt shap e. For example, for A ∈ Rm×n and B ∈ R n×m, we ha
shape. hav
ve
This invariance to cyclic permutation holds even if the resulting product has a
R R
different shape. For example, for A r(and B , we have (2.53)
YTr(AB ) = T Tr( BAY)
∈ ∈
even though AB ∈ Rm×m and TBr(A
even AB n×T
∈ )R= n r(B A)
. (2.53)
R mindRis that
evenAnother
though useful
AB fact to keep
and in
BA . a scalar is its own trace: a = Tr(a ).
Another useful∈fact to keep in mind
∈ is that a scalar is its own trace: a = Tr(a ).
2.11 The Determinan
Determinantt

2.11determinant
The The Determinan t
of a square matrix, denoted det
det((A ), is a function mapping
matrices to real scalars. The determinant is equal to the pro product
duct of all the
The
eigen
eigenvdeterminant
v of a square
alues of the matrix. The matrix,
absolute denoted
value of det
the(A ), is a function
determinant can bemapping
thought
matrices to real scalars.
of as a measure of how muc The
uch determinant is equal to the pro duct
h multiplication by the matrix expands or con of all the
contracts
tracts
eigen values of the matrix. The absolute value of the determinant can b e
space. If the determinant is 0, then space is contracted completely along at least thought
of asdimension,
one a measure causing
of how mit uc
tohlose
multiplication by the If
all of its volume. matrix expands or is
the determinant con1,tracts
then
space.
the If the determinant
transformation is 0, then space is contracted completely along at least
is volume-preserving.
one dimension, causing it to lose all of its volume. If the determinant is 1, then
the transformation is volume-preserving.
47
CHAPTER 2. LINEAR ALGEBRA

2.12 Example: Principal Comp


Componen
onen
onents
ts Analysis

2.12simple
One Example:
mac
machine
hine learning Principal
algorithm,Comp princip
principal alonen
comp ts
omponents
onentsAnalysis
analysis or PCA can
be deriv
deriveded using only kno knowledge
wledge of basic linear algebra.
One simple machine learning algorithm, principal(1) components analysis or PCA can
Supp
Supposeoseusing
we ha havve akno collection m poin
of basic oints , . . . , x (m)} in Rn . Supp
ts {xalgebra. Supposeose we
be deriv ed only wledge of linear
would like to apply lossy compression to these poin oints.ts. Lossy compression
R means
Supp ose w e
storing the points in a waha v e a collection
way of m p oin ts x
y that requires less memory but ma , . . . , x may in . Supp ose
y lose some precision. we
would
W e wouldlike liktoeapply
like to loselossy compression
as little precisiontoasthese { points. Lossy} compression means
possible.
storing the points in a way that requires less memory but may lose some precision.
We One wouldwalikyw e etocan
loseenco
encode
as de these
little points is
precision asto represen
representt a lo
possible. lower-dimensional
wer-dimensional version
of them. For each point x(i) ∈ R n we will find a corresponding co de vector c (i) ∈ R l.
code
If l One way wthan
is smaller e can nenco
, it de
willthese
take points is to represent a lower-dimensional
R less memory to store the co codede points than version
the
R
of them. F or each
original data. We will wan p oint x w e will
antt to find some enco find a
encoding corresponding
ding function that pro co de vector
produces c
duces the co de.
code
If l is smaller than n ,
for an input, f (x) = c, and∈a deco it will take less
decoding memory to
ding function that pro store the
produces co de p oints than
duces the reconstructed ∈ the
original
input giv data.
given W
en its co e will w an t to
de, x ≈ g(f (x)).
code, find some enco ding function that pro duces the co de
for an input, f (x) = c, and a decoding function that produces the reconstructed
input PCAgivenis defined
its co de, by xour cghoice
(f (x))of. the deco
decoding
ding function. Sp Specifically
ecifically
ecifically,, to makmakee the
deco
decoder der very simple, we choose to use matrix multiplication to map the co code
de back
PCAn is defined b y ≈ choice of the ndeco
our ×l ding function. Sp ecifically, to make the
in to R . Let g (c) = Dc, where D ∈ R
into is the matrix defining the deco decoding.
ding.
decoder very simple, we choose to use matrix multiplication to map the code back
R
Computing the optimal co code R deco
de for this decoder
der could be a difficult problem. To
into . Let g (c) = Dc, where D is the matrix defining the decoding.
keep the encoencodingding problem easy easy,, PCA constrains the colum columns ns of
ofDD to be orthogonal
Computing the optimal co de for∈ this decoder could be a difficult problem. To
to eac
each h other. (Note that D is still not technically “an orthogonal matrix” unless
l = n) encoding problem easy, PCA constrains the columns of D to be orthogonal
k eep the
to each other. (Note that D is still not technically “an orthogonal matrix” unless
With the problem as describ described ed so far, manmany y solutions are possible, because we
l = n)
can increase the scale of D:,i if we decrease c i prop proportionally
ortionally for all poin oints.
ts. To giv
givee
With the problem as describ ed so far, man y
the problem a unique solution, we constrain all of the columns of D to hasolutions are p ossible, b ecause
hav we
ve unit
can
norm. increase the scale of D if we decrease c prop ortionally for all p oints. To giv e
the problem a unique solution, we constrain all of the columns of D to have unit
In order to turn this basic idea in into
to an algorithm we can implement, the first
norm.
thing we need to do is figure out how to generate the optimal co de point c∗ for
code
eac
each hIninput
orderpto oint turn
x . this
One basic
way to idea
do in to an
this is toalgorithm
minimize wethe candistance
implement, betw the
een first
etween the
input point x and its reconstruction, g( c ). We can measure this distance usingfor
thing w e need to do is figure out how to∗ generate the optimal co de p oint c a
each input
norm. In the point x . One
principal wayonents
comp
components to do algorithm,
this is to minimize
we use the theL2distance
norm: between the
input point x and its reconstruction, g( c ). We can measure this distance using a
norm. In the principal components algorithm, we use the L norm:
c∗ = arg min ||x − g(c)||2 . (2.54)
c
c = arg min x g(c) . (2.54)
We can switch to the squared L 2 norm instead of the L2 norm itself, b ecause
both are minimized by the same value of|| c .−This ||
is b ecause the L 2 norm is non-
W e
negativ can switch to the squared
negativee and the squaring op L
operation norm instead of the
eration is monotonically L norm
increasing foritself, b ecause
non-negative
both are minimized by the same value of c . This is b ecause the L norm is non-
negative and the squaring operation is monotonically increasing for non-negative
48
CHAPTER 2. LINEAR ALGEBRA

argumen
arguments.
ts.
c∗ = arg min ||x − g(c)||22 . (2.55)
arguments. c
c = arg min tox g(c) .
The function being minimized simplifies (2.55)
|| − ||
− g(c))>to
The function being minimized(xsimplifies (x − g(c)) (2.56)

(by the definition of the L2 norm,


(by (x Eq.
g(c))2.30
(x) g(c)) (2.56)
− −
= xL> xnorm,
(by the definition of the − x>gEq.
(c) − g (c) )>x + g(c) >g(c)
2.30 (2.57)

(b
(by
y the distributiv = x x x g(c) g (c) x + g(c) g(c)
distributivee property) (2.57)
− −
= x>x − 2x > g(c) + g (c)>g(c)
(by the distributive property) (2.58)

(b ecause the scalar g(x)> x


(because = is x 2to
x equal x theg(c)transp
+ g (cose
) g(ofc)itself
transpose itself).
). (2.58)
We can now change the function− being minimized again, to omit the first term,
(because the scalar g(x) x is equal to the transpose of itself ).
since this term do
does
es not dep
depend
end on c:
We can now change the function being minimized again, to omit the first term,
c∗ dep
since this term do es not = arg on−c2: x> g(c) + g (c)>g(c).
endmin (2.59)
c
c = arg min 2x g(c) + g (c) g(c). (2.59)
To mak
makee further progress, we must substitute in the definition of g(c):

∗ >
To make further progress,
c = argwemin −2x
must Dc + c>inDthe
substitute >
Dcdefinition of g(c): (2.60)
c
c = arg min 2x >Dc + c >D Dc (2.60)
= arg min −2x Dc + c Ilc (2.61)
c −
(b
(by
y the orthogonalit
orthogonality = arg min 2x Dc +on
y and unit norm constraints c IDc) (2.61)
− >
(by the orthogonality and unit
= arg −2x Dc +on
minconstraints
norm c >D
c) (2.62)
c
= arg min 2x Dc + c c (2.62)
We can solve this optimization problem using vector calculus (see Sec. 4.3 if

you do not know how to do this):
We can solve this optimization problem using vector calculus (see Sec. 4.3 if
> >
you do not know how to do ∇ c (−2x D c + c c) = 0
this): (2.63)

( −22xD>Dxc++2cc =c)0= 0 (2.63)


(2.64)
∇ − c = D >x. (2.65)
2D x + 2c = 0 (2.64)
This mak
makes − c = Dw
es the algorithm efficient: wee xcan
. optimally enco de x just using
encode (2.65)a
matrix-v
matrix-vector op
ector operation.
eration. To enco
encode
de a vector, we apply the enco
encoder
der function
This makes the algorithm efficient: we can optimally encode x just using a
matrix-vector operation. To encode = D > xwe
f (xa)vector, . apply the encoder function(2.66)
49 D x.
f (x) = (2.66)
CHAPTER 2. LINEAR ALGEBRA

Using a further matrix multiplication, we can also define the PCA reconstruction
op
operation:
eration:
Using a further matrix multiplication,
r(x) = g (f (wxe))can
=D also
D>define
x. the PCA reconstruction (2.67)
operation:
Next, we need to choose r(xthe
) = enco
encoding
g (f (ding DD D
x)) =matrix x.. To do so, we revisit the
(2.67)
2
idea of minimizing the L distance bet etw
ween inputs and reconstructions. How However,
ever,
Next, we need to choose the enco
since we will use the same matrix D to deco ding matrix
decode D . To do so, w e revisit
de all of the points, we can no longerthe
idea of minimizing
consider the points the L distance
in isolation. between
Instead, inputsminimize
we must and reconstructions.
the Frob eniusHow
robenius normever,
of
since w e will use the same matrix
the matrix of errors computed ov D
over to deco de all of the p oints,
er all dimensions and all points: we can no longer
consider the points in isolation. Instead, we must minimize the Frob enius norm of
the matrix of errors computed
s over all dimensions and all points:
X  (i) 2
D ∗ = arg min xj − r (x(i))j sub ject to D> D = Il
subject (2.68)
D i,j
D = arg min x r (x ) sub ject to D D = I (2.68)

D , we will start by considering the case
To derive the algorithm for finding −
where l = 1
1.. In this case, D is just a single vector, d. Substituting Eq. 2.67 in
into
to
T o derive the algorithm
Eq. 2.68 and simplifying D for
in
into
to finding
d , the D , we
problem will startto
reduces by considering the case
s D
where l = 1. In this case, X is just a single vector, d. Substituting Eq. 2.67 into
Eq. 2.68 and simplifying DXinto d, the problem reduces to
d ∗ = arg min ||
||xx(i) − dd>x(i) || 22 sub ject to ||d||2 = 11..
subject (2.69)
d i
d = arg min x dd x sub ject to d = 1. (2.69)
The ab abo
ove form
formulation
ulation is the most direct wa way y of performing the substitution,
|| − || || ||
but is not the most stylistically pleasing way to write the equation. It places the
Thevalue
scalar abovde>form
x (i) ulation is theofmost
on the right the vdirect
ector wad. yItofispmore
erforming
con
conv the
ven substitution,
entional
tional to write
but is not
scalar co the
coefficientsmost stylistically
efficients on the left pleasing
Xof vector they op w ay to
operate write the equation. It places
erate on. We therefore usually write the
scalar
such a vform
such alueula
d x
formula as on the right of the vector d. It is more conventional to write
scalar coefficients on the left Xof vector they operate on. We therefore usually write
such a formula d∗ =as arg min ||
||xx(i) − d>x(i) d|| 22 sub ject to ||d||2 = 11,,
subject (2.70)
d i
d = arg min x d x d sub ject to d = 1, (2.70)
or, exploiting the fact that a scalar is its own transp
transpose,
ose, as
X || − || || ||
∗ (i) (i)> 2
or, exploiting the fact that a scalar is its own 2transpose, as d||2 = 1
d = arg min ||
||xx − x dd ||
dd|| sub
subject
ject to || 1.. (2.71)
d i
d = arg min X x x dd sub ject to d = 1. (2.71)
The reader should aim to become familiar with such cosmetic rearrangements.
|| − || || ||
At this point, it can be helpful to rewrite the problem in terms of a single
The reader should aim to become familiar with such cosmetic rearrangements.
design matrix of examples, rather than as a sum ov over
er separate example vectors.
A t this p
This will allo
alloww us to use more compact notation. Let X ∈ Rinm×terms
oint, it can b
X e helpful to rewrite the problem of amatrix
n b e the single
design matrix of examples, rather than as a sum ov er separate example vectors.
(i)>
defined by stacking all of the vectors describing the poin oints, R that X i,: = x .
ts, such
This
W willnow
e can allorewrite
w us totheuseproblem
more compact
as notation. Let X be the matrix
defined by stacking all of the vectors describing the points,∈such that X = x .
We can now rewrited ∗ =the min ||X −
arg problem > 2
asX dd ||F sub ject to d> d = 11..
subject (2.72)
d
d = arg min X X dd50 sub ject to d d = 1. (2.72)
|| − ||
CHAPTER 2. LINEAR ALGEBRA

Disregarding the constraint for the moment, we can simplify the Frob
robenius
enius norm
portion as follo
follows:
ws:
Disregarding the constraint for
argthe
minmoment,
||X − X we
dd>can
||2F simplify the Frobenius(2.73)
norm
portion as follows: d
 min X Xdd
arg  (2.73)
>
> >
= arg min Tr X− X
|| −dd X
|| − X dd (2.74)
d

(b
(by
y Eq. 2.49) = arg min Tr X X dd X X dd (2.74)
− −
(by Eq. 2.49 )
= arg min Tr( X X − X X dd − dd X X + dd> X >X dd>)
r(X > > > > >
(2.75)
d
 
= arg min Tr(X X X  X dd dd X X + dd X X dd ) (2.75)
> > > > >
= arg min Tr( X X ) − Tr(X X dd ) − Tr(dd X X ) + Tr(
r(X dd>X > X dd> )
r(dd
d
− −
= arg min Tr(X X ) Tr(X X dd ) Tr(dd X X ) + Tr(dd X X dd (2.76)
)
> > > > > > >
= arg min − Tr( X −X dd ) − Tr(dd −X X ) + Tr(
r(X r(dd
dd X X dd ) (2.77)
d (2.76)
(b = argterms
(because
ecause min notTr(in
X
inv X ddd)do T
volving r(dd
not X the
affect X )arg
+T r(dd
min ) X X dd ) (2.77)
− −
> affect
(because terms not inv olving d
= arg min −2 Tr( do
Xnot
r(X X dd>) +the
Tr(arg > )>
ddmin
r(dd X X dd>) (2.78)
d
= arg min 2 Tr(X X dd ) + Tr(dd X X dd ) (2.78)
(b
(because
ecause we can cycle the order of the matrices inside a trace, Eq. 2.52)

> matrices
(because we can=cycle the −
arg min order
2 Tr(of
r(X X dd>) + Tinside
X the r(X>X
r(X a dd >
trace, >
ddEq.
) 2.52) (2.79)
d
= arg min 2 Tr(X X dd ) + Tr(X X dd dd ) (2.79)
(using the same prop
propert
ert
erty
y again)

At this poin
oint,
t, we re-in
re-intro
tro
troduce
duce the constrain
constraint:
t:
(using the same prop erty again)
At this
arg min p−oin
2 Tt,r(
r(XXe >re-in
w X dd > duce the >
tro) + Tr( X dd>dd
X constrain
r(X t: >) sub ject to d > d = 1
subject (2.80)
d
arg min 2 Tr(X X>dd ) +> Tr(X X>dd dd> ) sub ject to d> d = 1 (2.80)
= arg min −2 Tr(X X dd ) + Tr(
r(X r(XX X dd ) sub
subject
ject to d d = 1 (2.81)
d

= arg
(due to the min 2 Tr(X X dd ) + Tr(X X dd ) sub ject to d d = 1
constraint) (2.81)

(due to the constraint)
= arg min − Tr(
r(XX > X dd>) sub ject to d>d = 1
subject (2.82)
d
= arg min Tr(X X dd ) sub ject to d d = 1 (2.82)
= arg max Tr(X > X dd> ) sub
r(X ject to d> d = 1
subject (2.83)
d

= arg max Tr(X> X>dd ) sub ject to d> d = 1 (2.83)
= arg max Tr(
r(d
d X X d) sub
subject
ject to d d = 1 (2.84)
d
= arg max Tr(d X X d) sub ject to d d = 1 (2.84)
51
CHAPTER 2. LINEAR ALGEBRA

This optimization problem mamay y be solved using eigendecomp


eigendecomposition.
osition. Sp
Specifically
ecifically
ecifically,,
the optimal d is given by the eigen
eigenv vector of X >X corresp
corresponding
onding to the largest
This
eigen
eigenv optimization problem may be solved using eigendecomposition. Specifically,
value.
the optimal d is given by the eigenvector of X X corresponding to the largest
eigen the general case, where l > 1, the matrix D is given by the l eigen
Invalue. eigenvvectors
corresp
corresponding
onding to the largest eigenv
eigenvalues.
alues. This may be shown using pro proofof by
In the general
induction. case, where
We recommend l > 1this
writing , thepro
matrix
proof D is
of as an given by the l eigenvectors
exercise.
corresponding to the largest eigenvalues. This may be shown using proof by
Linear algebra is one of the fundamen
fundamental tal mathematical disciplines that is
induction. We recommend writing this proof as an exercise.
necessary to understand deep learning. Another key area of mathematics that is
Linear algebra
ubiquitous in mac is one
machine
hine of the
learning fundamentaltheory
is probability mathematical
theory,, presenteddisciplines
next. that is
necessary to understand deep learning. Another key area of mathematics that is
ubiquitous in machine learning is probability theory, presented next.

52
Chapter 3
Chapter 3
Probabilit
Probability y and Information
Theory
Probability and Information
Theory describee probabilit
In this chapter, we describ probability
y theory and information theory
theory..
In this chapter, we describe probability theory and information theory.
Probabilit
Probability y theory is a mathematical framew framework ork for represen
representing
ting uncertain
In this chapter, we describe probability theory and information theory.
statemen
statements. It pro
ts. providesvides a means of quan
quantifying uncertain
tifying uncertaintt y and axioms for deriving
Probabilit y theory is a mathematical framew ork for represen
new uncertain statements. In artificial intelligence applications, we use probability ting uncertain
statemen
theory ints.
twIt projor
o ma vides
major waays.means of the
First, quanlaws
tifying uncertaintyy tell
of probabilit
probability and us
axioms
how AIfor systems
deriving
new uncertain statements. In artificial intelligence applications,
should reason, so we design our algorithms to compute or approximate various w e use probability
theory in twderiv
expressions o maed
derived jorusing
ways.probabilit
First, the
probability laws of. Second,
y theory
theory. probabilit wey can
tell use
us how AI systems
probability and
should reason,
statistics so we design
to theoretically our algorithms
analyze the beha
ehaviortoofcompute
vior prop
proposed
osed orAI
approximate
systems. various
expressions derived using probability theory. Second, we can use probability and
Probabilit
Probability
statistics y theory is analyze
to theoretically a fundamental
the behato
tool
ol ofof man
vior many
propyosed
disciplines of science and
AI systems.
engineering. We provide this chapter to ensure that readers whose background is
Probabilit
primarily y theory
in soft
softw is a fundamental
ware engineering tool of
with limited exp man
exposure
osurey disciplines
to probabilityof science
theory and
can
engineering. W e provide this
understand the material in this book. chapter to ensure that readers whose background is
primarily in software engineering with limited exposure to probability theory can
While probabilit
probability y theory allows us to make uncertain statements and reason
understand the material in this book.
in the presence of uncertaint
uncertainty y, information allows us to quan quantify
tify the amount of
While
uncertain
uncertaintt probabilit
y in a y theory
probabilit
probability y allows us
distribution.to make uncertain statements and reason
in the presence of uncertainty, information allows us to quantify the amount of
If you are already familiar with probability theory and information theory theory,,
uncertainty in a probability distribution.
you mamay y wish to skip all of this chapter except for Sec. 3.14, which describ describes
es the
If you are already
graphs we use to describ familiar with probability
describee structured probabilistic mo theorydels for machine learning. If,
models and information theory
you
you hav
maye wish
have to skipnoallprior
absolutely of this
exp chapter
erience except
experience for Sec.
with these sub 3.14, which
subjects,
jects, describshould
this chapter es the
graphs
b we usetotosuccessfully
e sufficient describe structured
carry outprobabilistic
deep learning moresearch
dels for machine
pro jects,learning.
projects, but we doIf
you havethat
suggest absolutely
you consultno prior experienceresource,
an additional with thesesuc
suchhsubasjects,
Ja
Jaynes this
ynes chapter
(2003 ). should
be sufficient to successfully carry out deep learning research pro jects, but we do
suggest that you consult an additional resource, such as Jaynes (2003).
53

53
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.1 Wh
Why
y Probabilit
Probability?
y?

3.1 y branc
Man
Many Whhes
branches y Probabilit y? deal mostly with entities that are en
of computer science entirely
tirely
deterministic and certain. A programmer can usually safely assume that a CPU will
Many branc
execute eac
each hhes of computer
machine science
instruction deal .mostly
flawlessly
flawlessly. Errorswith entities do
in hardware that are en
occur, tirely
but are
deterministic and certain.
rare enough that most softw A programmer
software can usually safely assume that
are applications do not need to be designed to account a CPU will
execute
for them.eacGivh machine
Given en that maninstruction
many y computerflawlessly . Errors
scientists andinsoftw
hardware
software do occur,
are engineers butinare
work a
rare enough
relativ
relatively that most softw are
ely clean and certain environmen applications
environment, do not need to b e designed
t, it can be surprising that mac machine to account
hine learning
for
mak them.
makeses hea
heavyGiv en that man
vy use of probabilit y computer
probabilityy theory
theory.. scientists and softw are engineers work in a
relatively clean and certain environment, it can be surprising that machine learning
makThis
es heaisvybecause
use of machine
probabilitlearning
y theorymust
. alwa
alwaysys deal with uncertain quantities,
and sometimes may also need to deal with sto stocchastic (non-deterministic) quan quantities.
tities.
This
Uncertain is b ecause
Uncertaintty and sto machine
stocchasticit
hasticity learning must
y can arise from man alwa
many ys deal with uncertain
y sources. Researc
Researchers quantities,
hers ha
havve made
and
comp sometimes
compelling
elling argumenmay
arguments also need to deal with sto
ts for quantifying uncertaint chastic
uncertainty y using probability since at tities.
(non-deterministic) quan least
Uncertain t
the 1980s. Many and
Many sto chasticit y can arise from man y sources. Researc
y of the arguments presented here are summarized from or inspired hers ha ve made
comp
b elling
y Pearl argumen
(1988 ). ts for quantifying uncertainty using probability since at least
the 1980s. Many of the arguments presented here are summarized from or inspired
Nearly all activities require some ability to reason in the presence of uncertaint uncertainty y.
by Pearl (1988).
In fact, beyond mathematical statements that are true by definition, it is difficult
Nearly
to think ofall
any activities
prop
propositionrequire
osition thatsome ability to reason
is absolutely true orinany
theeven
presence
event t thatofisuncertaint
absolutely y.
In fact,teed
guaran beyond
guaranteed mathematical statements that are true by definition, it is difficult
to occur.
to think of any proposition that is absolutely true or any event that is absolutely
There are three possible sources of uncertain uncertaintty:
guaranteed to occur.
There are three possible sources of uncertainty:
1. Inheren
Inherentt stochasticit
stochasticity y in the system being mo modeled.
deled. For example, most
in
interpretations
terpretations of quantum mechanics describ describee the dynamics of subatomic
1. particles
Inherent as stochasticit y in the system
being probabilistic. We can also b eingcreate
modeled. For example,
theoretical scenariosmost
that
ineterpretations
w postulate toofha havquantum
ve random mechanics
dynamics, describ
such easthe dynamics of card
a hypothetical subatomic
game
particles
where weasassume
being probabilistic.
that the cardsWare e can alsosh
truly create
shuffled
uffled in theoretical
into
to a random scenarios
order.that
we postulate to have random dynamics, such as a hypothetical card game
where we assume
2. Incomplete observ that
observability
ability the
ability. . Evcards
Even are truly shuffled
en deterministic systemsintocana random
app
appear order.
ear sto
stochastic
chastic
when we cannot observ observee all of the variables that drive the behavior of the
2. system.
Incomplete observ
For example, ability
in .the
EvMont
en deterministic
Monty y Hall problem,systems can sho
a game appwear
show consto chastic
contestan
testan
testant t is
when
ask
asked w e cannot
ed to choose betw observ
etween e all of
een three do the
doors variables that drive the b ehavior
ors and wins a prize held behind the chosen of the
system.
do
door.
or. Tw Foordo
example,
ors leadintothe
doors Montwhile
a goat y Halla problem,
third leadsa game showThe
to a car. contestan
outcome t is
ask
giv
givened the
en to choose betw
contestan
contestant’st’seen threeis do
choice ors and winsbut
deterministic, a prize
fromheld
the bcon
ehind thet’schosen
contestan
testan
testant’s poin
ointt
do or. T w o do ors lead to
of view, the outcome is uncertain. a goat while a third leads to a car. The outcome
given the contestant’s choice is deterministic, but from the contestant’s point
of view, the mo
3. Incomplete outcome
deling.is When
modeling. uncertain.
we use a mo model
del that must discard some of
the information we hav havee observ
observed,ed, the discarded information results in
3. uncertain
Incomplete mo deling.
uncertaintty in the mo model’sWhen we use a mo
del’s predictions. Fordel that must
example, suppdiscard
ose we some
suppose build of a
the
rob
robototinformation we hav
that can exactly e observ
observe theed,
lo the discarded
location
cation of every ob information
ject aroundresults
object in
it. If the
uncertainty in the model’s predictions. For example, suppose we build a
robot that can exactly observe the54location of every ob ject around it. If the
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

rob
robot ot discretizes space when predicting the future lo location
cation of these ob objects,
jects,
then the discretization makes the robot immediately become uncertain ab about
out
rob ot discretizes
the precise position of ob space when
objects: predicting
jects: eac each h ob the
object future lo cation of
ject could be anywhere within the these ob jects,
then the discretization
discrete cell that it was observ makes
observed the robot
ed to occup immediately
ccupy y. become uncertain about
the precise position of ob jects: each ob ject could be anywhere within the
discrete cell that it was observed to occupy.
In man
many y cases, it is more practical to use a simple but uncertain rule rather
than a complex but certain one, ev evenen if the true rule is deterministic and our
mo In man
modeling
deling y cases,
system hasitthe
is more
fidelit
fidelity practical
y to accommo to use
accommodate datea simple
a complex but rule.
uncertain rule rather
For example, the
than a complex but certain one, ev en if the true
simple rule “Most birds fly” is cheap to develop and is broadly useful, while a rule rule is deterministic and our
mothe
of delingform, system
“Birdshasfly the
fly, fidelityfor
, except to vaccommo
ery young date a complex
birds that ha havrule.
ve not Foryetexample,
learnedthe to
simple
fly rule “Most birds
fly,, sick or injured birds that hav fly” is cheap to develop
havee lost the abilit ability and is
y to fly broadly useful,
fly,, flightless sp while
species a
ecies of birdsrule
of the form,
including the “Birds
cassow
cassowary fly
ary,,except
ary, ostric
ostrich h for
andvkiwi.
ery y.oung . ” is birds
exp
expensivethat to
ensive hadev
ve not
elop,yet
develop, main learned
maintaintain and to
fly, sick
comm
communicate, or injured
unicate, and birds thatofhav
after all thise effort
lost the abilitvyery
is still to brittle
fly, flightless
and prone species of birds
to failure.
including the cassowary, ostrich and kiwi. . . ” is expensive to develop, maintain and
Giv
Given en that we need a means of representing and reasoning ab about
out uncertaint
uncertainty y,
communicate, and after all of this effort is still very brittle and prone to failure.
it is not immediately ob obvious
vious that probabilit
probability y theory can provide all of the to tools
ols
Given
we wan
want t for that we need
artificial in a means ofapplications.
intelligence
telligence representing Probability
and reasoning about
theory wasuncertaint
originally y,
it iselop
dev
developnoted
eloped immediately
to analyze ob thevious that probabilit
frequencies of evevenen yts.theory
ents. can provide
It is easy to see how all of the tools
probability
we wantcan
theory for bartificial
e used to intelligence
study ev even applications.
en
entsts like dra drawing Probability
wing a certaintheoryhand w ofascards
originally
in a
dev elop ed to analyze
game of poker. These kinds of even the frequenciesevents of ev en ts.
ts are often rep It is easy
repeatable. to see how
eatable. When we sa probability
sayy that
theory
an outcomecan bhas e used to study ev
a probability p enoftsoccurring,
like drawing a certain
it means that hand
if we of cards in
repeated thea
game
exp
experimenof poker.
erimen
eriment t (e.g.,These
draw akinds handofofeven cards)ts are often rep
infinitely man
manyeatable.
y times,When then propwe sa y that
proportion
ortion p
an outcome
of the reprepetitionshas a probability p of o ccurring, it
etitions would result in that outcome. This kind of reasoning do means that if w e repeated
does the
es not
exp erimen t (e.g., draw
seem immediately applicable to prop a hand of cards)
propositions infinitely man
ositions that are not rep y times, then
repeatable. prop
eatable. If a do ortion
doctor
ctorp
of the repaetitions
analyzes patientwould and sa result
saysys that in the
thatpatient
outcome. has This
a 40% kind
chanceof reasoning
of havingdothe es not
flu,
seem immediately applicable
this means something very different—w to prop ositions
different—wee can not mak that are not rep eatable.
makee infinitely man many If a do ctor
y replicas of
analyzes a patient
the patient, nor is there an and sa ys
any that the patient
y reason to believe that differen has a differentt replicas of the the
40% chance of having patienflu,
patient t
this means something very different—w
would present with the same symptoms yet hav e can not mak e infinitely man
havee varying underlying conditions. In y replicas of
the patient,
the case of the nor dois there
doctor
ctor an y reason
diagnosing to
the b elieve
patient, that
we differen t replicasto
use probability of represent
the patienat
would
de greee present
degr
gr elief,with
of belief , with the1 same symptoms
indicating absolute yet hav e varying
certaint
certainty y that underlying
the patient conditions.
has the flu In
the case
and of the doabsolute
0 indicating ctor diagnosing
certainttythe
certain thatpatient, we usedo
the patient probability
doeses not hav haveetotherepresent
flu. The a
de gr ee of b elief ,
former kind of probabilitywith 1 indicating absolute certaint
probability,, related directly to the rates at which even y that the patient
events has the
ts occur, is flu
and
kno
known 0 indicating
wn as fr absolute
freequentist pr prob
ob ability,, while the latter, related to qualitative flu.
certain
obability
ability t y that the patient do es not hav e the lev
levelsThe
els of
former
certainttkind
certain of probability
y, is known , related
as Bayesian pr
prob
obdirectly
ability.. to the rates at which events occur, is
obability
ability
known as frequentist probability, while the latter, related to qualitative levels of
If we list several properties that we expect common sense reasoning ab about
out
certainty, is known as Bayesian probability.
uncertain
uncertaintty to ha havve, then the only wa way y to satisfy those prop properties
erties is to treat
Ba
Bay If w e list several
yesian probabilities as beha properties
ehaving that w e expect common
ving exactly the same as frequentist sense reasoning about
probabilities.
uncertain
F or example, ty to if whaevwe,anthen
ant the onlythe
t to compute wayprobabilit
to satisfy
probability those
y that prop
a play
playerererties
will winis toa ptreat
oker
Ba yesian
game giv given probabilities as b eha ving exactly the same
en that she has a certain set of cards, we use exactly the same form as frequentist probabilities.
formulas
ulas
F or example, if w e w
as when we compute the probabilitan t to compute
probability the probabilit
y that a patien y that a play
patientt has a disease giv er will win
given a p
en that she oker
game given that she has a certain set of cards, we use exactly the same formulas
as when we compute the probability that 55 a patient has a disease given that she
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

has certain symptoms. For more details ab about


out why a small set of common sense
assumptions implies that the same axioms must con control
trol both kinds of probability
probability,,
has certain symptoms.
see Ramsey (1926). For more details ab out why a small set of common sense
assumptions implies that the same axioms must control both kinds of probability,
see Probabilit
Probability
Ramsey (1926y can).be seen as the extension of logic to deal with uncertaint
uncertainty y. Logic
pro
provides
vides a set of formal rules for determining what prop propositions
ositions are implied to
Probabilit
be true or falsey can be the
given seenassumption
as the extension
that of logicother
some to deal
set with uncertaint
of prop ositionsy.isLogic
propositions true
pro vides a set of
or false. Probabilit
Probabilityformal rules for determining what prop ositions are
y theory provides a set of formal rules for determining the implied to
b eeliho
lik trueoor
likeliho
elihoo false
d of given
a prop the assumption
proposition
osition that
being true giv en some
given the likother
likeliho
eliho
elihoooset of other
d of propositions
prop is true
propositions.
ositions.
or false. Probability theory provides a set of formal rules for determining the
likelihood of a proposition being true given the likelihood of other propositions.
3.2 Random Variables

3.2
A random Random
variable isV aariables
variable that can take on differen differentt values randomly
randomly.. W Wee
typically denote the random variable itself with a low lowerer case letter in plain typeface,
A random variable is a v ariable
and the values it can take on with low that er case script letters. tFor
lowercan take on differen values randomly
example, x1 and. Wx2e
typically
are both pdenote
ossiblethe random
values thatvthe ariable itselfvwith
random a low
ariable er case
x can takeletter
on. Fin
or plain typeface,
vector-v
vector-valued
alued
and the values it can take on with low er case script letters. For example,
variables, we would write the random variable as x and one of its values as x. On x and x
are b oth p ossible v alues that the random variable x can take on. For
its own, a random variable is just a description of the states that are possible; it vector-v alued
variables,
m we would
ust be coupled write
with a probability variable as xthat
the randomdistribution andsp one of its
specifies
ecifies values
how x. On
aseach
likely of
its own,
these a random
states are. variable is just a description of the states that are possible; it
must be coupled with a probability distribution that specifies how likely each of
theseRandom variables may be discrete or contin
states are. continuous.
uous. A discrete random variable
is one that has a finite or countably infinite num umb ber of states. Note that these
states are not necessarily the integers; they can also Ajust
Random variables may b e discrete or contin uous. discrete random
be named variable
states that
is one
are notthat has a finite
considered to hav ore any
have countably infinite
numerical num
value. Abcontin
er of states.
continuous Note that
uous random these
variable is
states
asso are
associated not necessarily
ciated with a real value. the integers; they can also just b e named states that
are not considered to have any numerical value. A continuous random variable is
associated with a real value.
3.3 Probabilit
Probability
y Distributions

3.3pr
A prob
ob Probabilit
obability
ability y Distributions
distribution is a description of how likely a random variable or
set of random variables is to take on each of its possible states. The way we
A probeability
describ
describe distribution
probability is a description
distributions dep ends onofwhether
depends how likely a random
the variables arevdiscrete
ariable or
or
set
con of uous.
contin
tin random variables is to take on each of its possible states. The way we
tinuous.
describe probability distributions depends on whether the variables are discrete or
continuous.
3.3.1 Discrete Variables and Probability Mass Functions
Probability

3.3.1
A Discrete
probabilit
probability Variables
y distribution ov
over and Probabilit
er discrete y Mass
variables ma
may Functions
y be describ
described
ed using a pr
prob
ob
oba-
a-
bility mass function (PMF). We typically denote probabilit
probability
y mass functions with a
A probability distribution
capital P . Often we asso ov
associateer discrete variables ma y b e describ
ciate each random variable with a different ed using a proba-
probability
bility mass function (PMF). We typically denote probability mass functions with a
capital P . Often we associate each random 56 variable with a different probability
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

mass function and the reader must infer which probability mass function to use
based on the identit
identity y of the random variable, rather than the name of the function;
P (x) is usually not the reader
mass function and the same asmP ust
(y)infer
. which probability mass function to use
based on the identity of the random variable, rather than the name of the function;
P (xThe probabilit
probability
) is usually notythemass
samefunction
as P (ymaps
). from a state of a random variable to
the probabilit
probability y of that random variable taking on that state. The probabilit probabilityy
thatThe
x =probabilit
x is denotedy massas Pfunction
(x), withmaps from a state
a probability of 1 of a random
indicating variable
that x = x to
is
the probabilit
certain and a yprobabilit
of that random
probability variable taking
y of 0 indicating that x on= xthat state.
is imp The Sometimes
impossible.
ossible. probability
that
to disam= x
x biguateis denoted
disambiguate whic
which has P (x
PMF )to , with
use, we a probability of 1 indicating
write the name = x is
that xvariable
of the random
certain andP (ax probabilit
explicitly: y of 0 indicating
= x). Sometimes we define athat x = xfirst,
variable is imp
thenossible. Sometimes
use ∼ notation to
to
sp disam
specify
ecify biguate
whic
which which PMF
h distribution to use,
it follo
follows we write
ws later: x ∼ Pthe
(x)name
. of the random variable
explicitly: P (x = x). Sometimes we define a variable first, then use notation to
Probabilit
Probability
specify which ydistribution
mass functions can
it follo wsact on many
later: x Pv(ariables
x). at the same time. Suc
Such h

a probability distribution over many variables is known as a joint pr prob
ob
obability
ability
Probabilit y mass functions can act on ∼ variables at the same time. Such
many
distribution
distribution.. P (x = x, y = y ) denotes the probabilit
probability y that x = x and y = y
a
simprobability
simultaneously
ultaneously distribution
ultaneously.. We ma may o v er many variables
y also write P (x, y) for brevity is known
brevity.. as a joint probability
distribution. P (x = x, y = y) denotes the probability that x = x and y = y
To be a probability
simultaneously . We maymass also function
write P (on x, ya) random variable
for brevity . x, a function P must
satisfy the follo
following
wing prop
properties:
erties:
To be a probability mass function on a random variable x, a function P must
satisfy
• Thethe domain
following of prop
P musterties:
be the set of all possible states of x.

• ∀ x ∈domain
The x, 0 ≤ Pof(xP ) ≤must
1. Anbeimptheossible
set of ev
impossible allen
even
entptossible states of
has probabilit y 0x.and no state can
probability
• be less probable than that. Likewise, an ev even
en
entt that is guaran
guaranteed
teed to happ
happenen
x x , 0
has probabilit P
probability ( x ) 1. An imp ossible
y 1, and no state can ha ev
hav en t has probability 0 and no
ve a greater chance of occurring. state can
b∀e less
• P ∈ probable
≤ than
≤ that. Likewise, an event that is guaranteed to happen
• hasx∈x P (x) = y
probabilit 11.. 1W, and no to
e refer state
thiscan
prophaerty
ve aasgreater
property chance ofdo. ccurring.
being normalize
normalized Without this
prop
propert
ert
erty y, we could obtain probabilities greater than one by computing the
P (xy) =
probabilit
probability . Wof
of 1one e refer
man
many ytoevthis
even
en tsprop
ents erty as being normalized. Without this
occurring.
• property, we could obtain probabilities greater than one by computing the
Forprobabilit
example,yconsider
of one ofa man y ev
single ents occurring.
discrete random variable x with k differen
differentt states.
We can place a uniform distribution on x —that is, make each of its states equally
lik ForPexample,
likely—b
ely—b
ely—by y settingconsider a single
its probabilit
probability discrete
y mass randomtovariable x with k different states.
function
We can place a uniform distribution on x —that is, make each of its states equally
likely—by setting its probability mass function1 to
P (x = xi ) = (3.1)
k
1
for all i. We can see that this fits P the =x)=
(xrequirements (3.1)
k for a probability mass function.
The value 1k is positiv
ositivee because k is a positiv
ositivee in
integer.
teger. We also see that
for all i. We can see that this fits the requirements for a probability mass function.
The value is positive bX X e1 integer.
ecause k is a positiv k We also see that
P (x = xi ) = = = 11,, (3.2)
k k
i i 1 k
P (x = x ) = = = 1, (3.2)
so the distribution is prop erly normalized. k
properly k
57
so the distribution is properly normalized.
X X
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.3.2 Con
Contin
tin
tinuous
uous Variables and Probabilit
Probability
y Densit
Density
y Functions

3.3.2 wCon
When tinwith
orking uouscon Vtin
ariables
contin
tinuous and Probabilit
uous random variables, wey Densit
describeey probabilit
describ Functions
probability y dis-
tributions using a prprob
ob
obability
ability density function (PDF) rather than a probability
When w orking with con tinuous yrandom
mass function. To be a probabilit
probability densityvfunction,
density ariables, awe describpe mprobabilit
function ust satisfyy the
dis-
tributions
follo
following using
wing prop a probability density function (PDF) rather than a probability
properties:
erties:
mass function. To be a probability density function, a function p must satisfy the
follo•wing
Theprop
domain of p must be the set of all possible states of x.
erties:
• ∀ Thex ∈domain
x, p(x) ≥ of 0p. mNote
ust that
be theweset
doofnot allrequire x) ≤ 1of
possiblep(states . x.
R
•• xp(x)xdx , p(=x)11.. 0. Note that we do not require p(x) 1.
A• ∀probabilit
p(∈x)dx =y 1densit
probability ≥
density
. y function p(x) do doeses not giv givee the ≤probability of a sp specific
ecific
state directly
directly, , instead the probability of landing inside an
• probability density function p(x) does not give the probability of a specific infinitesimal region with
A
volume δx is given by p(x)δx.
state directly, instead the probability of landing inside an infinitesimal region with
We can integrate the densit densityy function to find the actual probability mass of a
volume R δx is given by p(x)δx.
set of points. Specifically
Specifically,, the probabilit
probability y that x lies in some set S is giv given
en by the
in W
integral e can integrate
tegral of p (x) ov over the densit y
er that set. In the function
univ to
univariate find example,
ariate the actualthe probability
probabilit
probability mass
y that of xa
R S
set of
lies in pthe
oints.
in Specifically
interv
terv
tervalal [a, b] is, the probabilit
given by y that x lies in some set is given by the
] p(x)dx.
integral of p (x) over that set. In the[a,b univ ariate example, the probability that x
For an example of a probability density function corresp corresponding
onding to a sp specific
ecific
lies in the interval [a, b] is given by p(x)dx.
probabilit
probability y density over a contin continuous
uous random variable, consider a uniform distribu-
tionFon or an example
an in
interv
terv al ofofthe
terval a probability
real num
numbers. density
bers. We canfunction
do this corresp
with aonding
function to u
a (sp
x; ecific
a, b),
probabilit y density o v er a
where a and b are the endpoints of the incontin uous random
interv
terv
terval, variable, consider a uniform
al, with b > a. The “;” notation means distribu-
tion on an in terv al of the real num
“parametrized by”; we consider x to Rbe the argumentbers. W e can do this withfunction,
of the a function u( xa
while ; a, b),
and
bwhere a and b are that
are parameters the endpoints
define theoffunction.
the intervT al,o with
ensureb> a. The
that there“;”isnotation means
no probability
“parametrized
mass outside the by”;in wterv
e consider
interv
terval, x toub(ex;the
al, we say a, b)argument
= 0 for of allthe
x 6∈function, while a[ a,
[a, b]. Within andb],
bu(are parameters 1 that define the function. T o ensure
x; a, b) = b−a . We can see that this is nonnegative everywhere. Additionallythat there is no probability
dditionally,, it
mass
in outside the in terv al, we
tegrates to 1. We often denote that x follo
integrates say u ( x; a, b ) =
follows 0 for all x [a,
ws the uniform distribution b ] . Within a, bb]],
on [[a,
buy(xwriting
; a, b) =x ∼ .UW (a,e bcan
). see that this is nonnegative everywhere. 6∈ Additionally, it
integrates to 1. We often denote that x follows the uniform distribution on [a, b ]
by writing x U (a, b).
3.4 Marginal ∼ Probability

3.4
SometimesMarginal
we know theProbability
probabilit
probabilityy distribution over a set of variables and we wan
over wantt
to know the probability distribution over just a subset of them. The probability
Sometimes woevknow
distribution er thethe probabilit
subset is knoywn
knowndistribution ov
as the mar er a set
marginal
ginal pr
prob
obofability
variables
obability and we want
distribution.
to know the probability distribution over just a subset of them. The probability
For example, supp
suppose
ose we ha
havve discrete random variables x and y, and we know
distribution over the subset is known as the marginal probability distribution.
P (x, y). We can find P (x) with the sum rule:
For example, suppose we have discrete X random variables x and y, and we know
∀ x ∈ x, P ( x = x ) =
P (x, y). We can find P (x) with the sum ruleP: (x = x, y = y ). (3.3)
y
x x, P (x = x) =58 P (x = x, y = y ). (3.3)
∀ ∈

X
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

The name “marginal probabilit


probability”
y” comes from the pro process
cess of computing marginal
probabilities on pap er. When the values of P (x, y ) are written in a grid with
paper.
Thetname
differen
different “marginal
values probabilit
of x in rows y” comes vfrom
and different aluesthe
of pro
y incess of computing
columns, marginal
it is natural to
probabilities on pap er. When the values of P (x , y ) are written
sum across a row of the grid, then write P ( x) in the margin of the pap in a grid
paper with
er just to
differen
the right values of
rightt of the ro x
row.
w. in rows and different v alues of y in columns, it is natural to
sum across a row of the grid, then write P ( x) in the margin of the paper just to
For contin
continuous
uous variables, we need to use integration instead of summation:
the right of the row.
Z
For continuous variables, we need to use integration instead of summation:
p(x) = p(x, y )dy dy.. (3.4)

p(x) = p(x, y )dy. (3.4)


3.5 Conditional Probability

3.5manyConditional
In cases, we are inte Probability
rested in theZprobabilit
interested probability y of some even event,t, given that some
other ev even
en
entt has happened. This is called a conditional pr probob
obability
ability
ability.. We denote
In many cases, we are
the conditional probabilit interested
probability in the probabilit
y that y = y giv en x = x as P (y = t,y given
given y of some even | x = that some
x). This
other event probabilit
conditional has happened.
probabilityy can bThis is called
e computed a conditional
with the form ulaprobability. We denote
formula
the conditional probability that y = y given x = x as P (y = y x = x). This
conditional probability can be computed with P (y = x = ula
they,form x) |
P (y = y | x = x) = . (3.5)
P (x = x)
P (y = y, x = x)
P (y = y x = x) = . (3.5)
The conditional probability is only defined when PP (x( x==x)x) > 0. We cannot compute
the conditional probabilit
probability |
y conditioned on an eveven
en
entt that nev
never er happ
happens.
ens.
The conditional probability is only defined when P ( x = x) > 0. We cannot compute
It is imp
important
ortant not to confuse conditional probability with computing what
the conditional probability conditioned on an event that never happens.
would happhappen en if some action were undertaken. The conditional probability that
It is imp
a person is from ortantGerman
not to yconfuse
Germany giv
given conditional
en that they sp probability
eak Germanwith
speak computing
is quite what
high, but if
would
a randomlyhappen if someperson
selected actioniswere undertaken.
taught to sp
speak
eak The
German,conditional
their probability
coun
country
try of that
origin
a
do peserson
does is from German
not change. y givthe
Computing en consequences
that they speak of anGerman
action isis quite
calledhigh,
makingbutanif
a randomly selected
intervention query person
query.. Interv is taught
Intervention
ention queriestoare
spthe
eak domain
German, of their
causalcounmo try of, origin
modeling
deling
deling, which
do es not change. Computing
we do not explore in this book. the consequences of an action is called making an
intervention query. Intervention queries are the domain of causal modeling, which
we do not explore in this book.
3.6 The Chain Rule of Conditional Probabilities

3.6
An
Any The
y join
joint Chain
t probabilit
probability Rule ofover
y distribution Conditional
man
many Probabilities
y random variables ma
mayy be decomp
decomposed
osed
in
into
to conditional distributions over only one variable:
Any joint probability distribution over many random variables may be decomposed
into conditional (1)
, . . . , x (n) ) o=ver
P (xdistributions x(1))Πone
P (only n variable:
i=2P (x
(i)
| x (1), . . . , x(i−1) ). (3.6)

x , . is
P (ation x wn
. . ,kno )Π P (x
(x chain
) =asPthe x ,...,x (3.6)
This observ
observation known rule or pr oduct rule of )probability
pro .
probability.
. It
follo
follows |
ws immediately from the definition of conditional probability in Eq. 3.5. For
This observation is known as the chain rule or product rule of probability. It
follows immediately from the definition59of conditional probability in Eq. 3.5. For
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

example, applying the definition twice, we get

example, applying thePdefinition twice,


(a, b, c) = P (aw|ebget
, c)P (b, c)
P (b, c) = P (b | c)P (c)
P (a, b, c) = P (a b, c)P (b, c)
P (a, b, c) = P (a || b, c)P (b | c)P (c).
P (b, c) = P (b c)P (c)
P (a, b, c) = P (a | b, c)P (b c)P (c).
3.7 Indep
Independence
endence and Conditional
| | Indep
Independence
endence

3.7
T Indep
wo random endence
variables x and yand Conditional
are independent if theirIndep
independent endence
probability distribution can
be expressed as a pro
product
duct of tw
twoo factors, one inv
involving
olving only x and one inv
involving
olving
Two random
only y: variables x and y are independent if their probability distribution can
be expressed as a product of two factors, one involving only x and one involving
only y:
∀x ∈ x, y ∈ y, p(x = x, y = y ) = p(x = x)p(y = y). (3.7)

x, y xy,and
Two random xvariables p(xy=are
x, yconditional
= y ) = p(xly=indep
onditionally y = y).given a random
x)p(endent
independent (3.7)
variable z if the∀conditional
∈ ∈ probability distribution over x and y factorizes in this
Two random
way for ev ery value of z: x and y are conditionally independent given a random
every variables
variable z if the conditional probability distribution over x and y factorizes in this
way for every value of z:
∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z )p(y = y | z = z).
(3.8)
x x, y y, z z, p(x = x, y = y z = z) = p(x = x z = z )p(y = y z = z).
We can denote indep
independence
endence and conditional indep
independence
endence with compact
∀ ∈ ∈ ∈ | | | (3.8)
notation: x⊥y means that x and y are indepindependen
endent, while x⊥y | z means that x
endent,
andWy eare
can denote indep
conditionally endence
indep endentand
independen
enden t giv conditional
given
en z. independence with compact
notation: x y means that x and y are independent, while x y z means that x
and y are conditionally
⊥ independent given z. ⊥ |
3.8 Exp
Expectation,
ectation, Variance and Co
Cov
variance

3.8 exp
The expeeExp ectation,
ctation or exp
expeecte
cted Variance
d value and Co
of some function f (xv)ariance
with resp
respect
ect to a probabilit
probability
y
distribution P (x ) is the av erage or mean value that f tak
average es on when x is drawn
takes
exp ectation or exp ected value of some function f (x
from P . For discrete variables this can be computed with aresp
The ) with ect to a probability
summation:
distribution P (x ) is the average or mean value that f takes on when x is drawn
X
from P . For discrete variables Ex∼Pthis
[f (xcan
)] =be computed
P (x)f (x)with
, a summation: (3.9)
E x
[f (x)] = P (x)f (x), (3.9)
while for con
contin
tin
tinuous
uous variables, it is computed with an integral:
Z
while for continuous variables, it is computed with an integral:
Ex∼p [f (x)] = p(x)f (x)dx. (3.10)
X
E
[f (x)] = p(x)f (x)dx. (3.10)
60

Z
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

When the iden identit


tit
tityy of the distribution is clear from the context, we ma mayy simply
write the name of the random variable that the exp expectation
ectation is ovover,
er, as in
inEE x[f ( x )]
)]..
When the iden
If it is clear whictit
which y of the distribution is clear from the context,
h random variable the expectation is over, we may E we ma y simply
omit the
write the
subscript enname of the random
tirely,, as in E[f (x)]
entirely
tirely v ariable that the exp ectation is over, as
)].. By default, we can assume that E[·] averages in [f (ov
xer
)].
over
If it is clear whic h random
the values of all the random variable the expectation is over, w e may omit the
E variables inside the brack brackets.
ets. Lik
Likewise,
ewise,
E when there is
subscript
no am
ambiguitentirely
biguit
biguityy, we, may [f (xthe
as inomit )]. square
By default,
brack ets.can assume that [ ] averages over
we
brackets.
the values of all the random variables inside the brackets. Likewise, · when there is
Exp
Expectations
ectations are linear, for example,
no ambiguity, we may omit the square brackets.
Expectations are
Ex [linear, forβexample,
αf (x) + g(x)] = αEx [f (x)] + β Ex[g(x)], (3.11)
E E E
when α and β are not[αf (xendent
dep ) + βg(on
dependent x)] x=. α [f (x)] + β [g(x)], (3.11)
The varianc
variancee gives a measure of ho
howw muc uch
h the values of a function of a random
when α and β are not dependent on x.
differentt values of x from its probability distribution:
variable x vary as we sample differen
The variance gives a measure of hohw much the values iof a function of a random
variable x vary as we sample
Var(f (differen
x)) = Et v(alues
f (x) of
−E x[ffrom 2 probability distribution:
(x)])its . (3.12)
E E
When the variance is lo V
w,ar(
low, f (x
the )) = of f(f(x(x) )cluster
values [f (near
x)]) their. exp ected value.(3.12)
expected The
square ro root
ot of the variance is known as the standar −
standard d deviation
deviation..
When the variance is low, the values of f (x ) cluster near their expected value. The
Thero
square covarianc
ovariance
ot of thee vgiv gives
es some
ariance sense ofashow
is known themuc
much h tw
standar twodo v alues are. linearly related to
deviation
eac
eachh other, as well as the scale of these h variables: i
The covariance gives some sense of how much two values are linearly related to
each other, as Co wv(ell
Cov( f (asx),the
g(yscale
)) = Eof[(these
f (x) − variables:
E [f (x)]) (g (y) − E [g(y)])] . (3.13)
E E E
High absolute Covv( f (x),ofg(the
alues y)) =cov [(f (x) mean
covariance
ariance [f (that
x)]) (the
g (y)values y)])] . very (3.13)
[g(change muc
much h
and are both far from their resp respectiv
ectiv
ectivee −means at the same − time. If the sign of the
High
co
cov absolute
variance valuese,ofthen
is positiv
ositive, the bcov
othariance
variablesmean tend that theevon
to tak
take alues change high
relatively veryvmuc
aluesh
and
sim are both far
simultaneously
ultaneously
ultaneously. . Iffrom theirofresp
the sign theectiv
co
cov e meansis at
variance the same
negative, thentime.
one Ifvariable
the sign of the
tends to
co
takvariance is p ositiv e, then b oth variables
takee on a relatively high value at the times that the other tak tend to tak e on relatively
takes high v alues
es on a relatively low
sim
valueultaneously
and vice .versa.
If theOther
sign of the covariance
measures such as iscorr negative,
orrelation
elation then
normalizeone vthe
ariable
con tends to
contribution
tribution
takeac
of e on
each h vaariable
relatively high to
in order value at theonly
measure timesho
how wthat
much thetheother takes on
variables arearelated,
relatively low
rather
value also
than and bvice
eingversa.
affected Other
by measures
the scale suchof the asseparate
correlation normalize the contribution
variables.
of each variable in order to measure only how much the variables are related, rather
The notions of co cov variance and dep dependence
endence are related, but are in fact distinct
than also being affected by the scale of the separate variables.
concepts. They are related because two variables that are indep independent
endent havhavee zero
co
cov The notions
variance, and tw of
two co variance and
o variables that hav dep endence
havee non-zero covare related,
covariance but
ariance are dep are in fact
dependent.
endent. distinct
Ho
How-w-
concepts.
ev
ever, They are related
er, independence is a distinct prop b ecause t
propertyw o variables
erty from co cov that are indep endent
variance. For two variables to hav hav e zero
havee
co variance,
zero cocov and tw o v ariables that
variance, there must be no linear dep hav e non-zero
dependence cov
endence betw ariance
etween are dep
een them. Indep endent. Ho
Independence
endence w-
ever,
is independence
a stronger requirementis a distinct
than zeropropcoerty
cov from cobvecause
variance, ariance. Forendence
indep two variables
independence to have
also excludes
zero covariance,
nonlinear there must
relationships. It isbepossible
no linear fordep
tw
two endence
o variables betwtoeenbe them.
dep
dependentIndepbut
endent endence
ha
hav
ve
is a stronger
zero cov
covariance.requirement than zero co variance,
ariance. For example, suppose we first sample a real num b ecause indep endence
number also excludes
ber x from a
nonlinear relationships.
uniform distribution over the in It is p ossible
interv
terv
terval for tw
al [− 1, 1]1].. We next sample a randombut
o variables to b e dep endent have
variable
zero covariance. For example, suppose we first sample a real number x from a
uniform distribution over the interval [ 611, 1]. We next sample a random variable

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

s. With probabilit
probability y 12, we choose the value of s to be 1. Otherwise, we choose
the value of s to be − 1. We can then generate a random variable y by assigning
ariabley
sy.=With probabilit
sx. Clearly y , we choose
Clearly,, x and y are not indep the value
independen
enden
endent,of s to b e 1 . Otherwise, we choose
t, because x completely determines
the s
the magnitude
value of toof bye. How
1. W
Howev eveer,can
ever, Co then
v(x, ygenerate
Cov( y
) = 0. a random variable by assigning
y = sx. Clearly, x and− y are not indep endent, because x completely determines
ovariancee matrix of a random vector x ∈ Rn is an n × n matrix, suc
The covarianc such
h that
the magnitude of y. However, Cov(x, y) = 0.
R
The covariance matrix of Co Cov(v(x) i,j =
a random Co
Cov(
v(xxi, x j ). is an n n matrix, suc(3.14)
v(x
vector h that
The diagonal elemen
elements
ts of theCo
co
cov
variance give ∈ ×
v( x) = Co v(the
x , xvariance:
). (3.14)
The diagonal elements of the Co
Cov(
covv(
v(xxi , xi) =
ariance V thexiv)ariance:
Var(
givear(
ar(x . (3.15)

Cov(x , x ) = Var(x ). (3.15)


3.9 Common Probability Distributions

3.9eral Common
Sev
Several Probability
simple probability distributionsDistributions
are useful in many con
contexts
texts in machine
learning.
Several simple probability distributions are useful in many contexts in machine
learning.
3.9.1 Bernoulli Distribution

3.9.1Bernoul
The Bernoulli Distribution
li distribution
Bernoulli is a distribution ov overer a single binary random variable.
It is controlled by a single parameter φ ∈ [0 [0,, 1]
1],, whic
whichh gives the probability of the
The Bernoul li distribution is a distribution ov er a
random variable being equal to 1. It has the following prop single binary
erties:random variable.
properties:
It is controlled by a single parameter φ [0 , 1], which gives the probability of the
random variable being equal to 1.PIt(xhas =∈1)
the=following
φ properties: (3.16)
P (Px(=
x=0)1)
==1−
φφ (3.17)
(3.16)
P (x P=(xx)==0)φx=(11 − φφ)1−x (3.18)
(3.17)
P (x = xE) x=[xφ
] =(1φ − φ) (3.19)
(3.18)
Var x(Ex)[x=] =
φ(1φ− φ) (3.20)
(3.19)
Var (x) = φ(1 φ) (3.20)
3.9.2 Multinoulli Distribution

3.9.2multinoul
The Multinoulli
multinoullili or cate Distribution
ategoric
goric
gorical
al distribution is a distribution ov
over
er a single discrete
differentt states, where k is finite.1 The multinoulli distribution is
variable with k differen
The multinoulli or categorical distribution is a distribution over a single discrete
1
“Multinoulli”
variable is a termt that
with k differen waswhere
states, recently
k coined by Gustavo
is finite. The mLacerdo anddistribution
ultinoulli popularized by
is
Murphy (2012). The multinoulli distribution is a special case of the multinomial distribution. A
multinomial distribution is the distribution over vectors in {0 , . . . , n} k representing how many
times each of the k categories is visited when n samples are drawn from a multinoulli distribution.
Many texts use the term “multinomial” to refer to multinoulli distributions without clarifying
that they refer only to the n = 1 case.

62
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

parametrized by a vector p ∈ [0 [0,, 1]k−1 , where pi giv es the probability of the i-th
gives
state. The final, k -th state’s probability is given by 1 − 1> p. Note that we must
parametrized
constrain 1 > p by
≤ 1a. vector p [0
Multinoulli , 1] , where
distributions arep often
givesused
the to
probability of the i-th
refer to distributions
ostate.
ver categories k
The final,of ob -thjects,
state’s
objects, probability
so∈we is given
do not usually 1 p .
by 1 that state
assume Note1 has
thatnumerical
we must
constrain 1 p 1 . Multinoulli distributions
value 1, etc. For this reason, we do not usually need to are often used to refer to
− compute the expdistributions
expectation
ectation
ov er categories ≤of ob jects, so
or variance of multinoulli-distributedwe do notrandom variables.that state 1 has numerical
usually assume
value 1, etc. For this reason, we do not usually need to compute the expectation
The Bernoulli and multinoulli distributions are sufficient to describ describee an
anyy distri-
or variance of multinoulli-distributed random variables.
bution over their domain. This is because they mo model
del discrete variables for whicwhichh
The Bernoulli and multinoulli distributions are sufficient to
it is feasible to simply enumerate all of the states. When dealing with contin describ e an y distri-
continuous
uous
vbution overthere
ariables, theirare
domain.
uncoun This
tablyis many
uncountably because theyso
states, moany
del distribution
discrete variables
describforedwhic
described by ha
it is feasible
small num umb to simply enumerate
ber of parameters must imp all of
imposethe states. When dealing with
ose strict limits on the distribution. contin uous
variables, there are uncountably many states, so any distribution described by a
small number of parameters must impose strict limits on the distribution.
3.9.3 Gaussian Distribution

3.9.3
The mostGaussian
commonly Distribution
used distribution over real num numb bers is the normal distribution,
also kno
known
wn as the Gaussian distribution :
The most commonly used distribution over real numbers is the normal distribution,
also known as the Gaussian distribution r :  
2 1 1 2
N (x; µ, σ ) = exp − 2 (x − µ) . (3.21)
2πσ2 2σ
1 1
(x; µ, σ ) = exp (x µ) . (3.21)
See Fig. 3.1 for a plot of the densit density
2π σ y function. 2σ
N − −
See Fig. 3.1 for a plotµof∈ the σ y∈ function.
(0, ∞ ) control the normal distribution.
The twtwo o parameters R and
densit
The parameter µ giv gives
es the cocoordinate
ordinate
R r of the central peak. This is also the mean of
the The two parameters
distribution: and σ (0
E[ x] = µµ. The standard ,  ) control
deviation of thethe
normal distribution.
distribution is given by
The parameter µ giv es the
2 co ordinate of the central p eak. This is also the mean of
σ, and the variance E by σ . ∈ ∈ ∞
the distribution: [ x] = µ. The standard deviation of the distribution is given by
When we ev evaluate
aluate the PDF, we need to square and inv ert σ. When we need to
invert
σ, and the variance by σ .
frequen
frequently
tly ev
evaluate
aluate the PDF with differendifferentt parameter values, a more efficient way
When we evaluate
of parametrizing the PDF, weisneed
the distribution to use to asquare and inv
parameter ∈ (0σ,. ∞
β ert When we needthe
) to control to
frequen
pr
pre
ecisiontlyorevinv
aluate
erse the
inverse PDF of
variance with
thedifferen t parameter values, a more efficient way
distribution:
of parametrizing the distribution is to use a parameter β (0, ) to control the
precision or inverse variance of ther distribution:  ∈ ∞
−1 β 1
N (x; µ, β ) = exp − β (x − µ)2 . (3.22)
2π 2
β 1
(x; µ, β ) = exp β (x µ) . (3.22)
Normal distributions are a sensible2choice π for many
2 applications. In the absence
of prior knowledge N ab
about
out what form a distribution − ov−
er the real num numbers
bers should
tak Normal
take, distributions are a
e, the normal distribution is a go sensible
goo choice for many applications.
od default choice for two ma major In the absence
jor reasons.
of prior knowledge about what form r a distribution
 over thereal numbers should
First, many distributions we wish to mo modeldel are truly close to being normal
take, the normal distribution is a good default choice for two ma jor reasons.
distributions. The centr entralal limit the
theor
or
orem
em shows that the sum of many indep independent
endent
First, many distributions we wish to mo del are truly
random variables is approximately normally distributed. This means that in close to being normal
distributions. The central limit theorem shows that the sum of many independent
random variables is approximately normally 63 distributed. This means that in
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

The normal distribution


0.40
0.35 The normal distribution
0.40 Ma
Maxim
xim
ximu
um aatt x = ¹
0.30
0.35
0.25 In"ection points
x = at
¹
p(x)

p(x)

0.30 Maxim
x =u¹m§a¾t
0.20
0.25 In"ection
x = ¹ points at
0.15 ¾
0.20 §
0.10
0.15
0.05
0.10
0.00
−2.0
0.05 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x
0.00 2
Figure 3.1: The
−2.0 normal
−1.5 distribution
−1.0 : The
−0.5 normal
0.0 distribution
0.5 N ) exhibits
(x; µ, σ 1.5
1.0 2.0a classic
“b
“bell
ell curv
curve”
e” shape, with the x co
coordinate
ordinate of its central peak given by µ, and the width
Figure Thetrolled
normalbydistribution x distribution (x; µ, σ ) exhibits a classic
of its p3.1:
eak con
controlled σ. In this :example,
The normal
we depict the standar
standard distribution,,
d normal distribution
“b ell µ
with curv
= 0e”and
shape, 1. the x co ordinate of its central peakNgiven by µ, and the width
σ =with
of its p eak controlled by σ. In this example, we depict the standard normal distribution,
with µ = 0 and σ = 1.
practice, man
many y complicated systems can be mo deled successfully as normally
modeled
distributed noise, even if the system can be decomp decomposedosed into parts with more
practice, man
structured beha y complicated
ehavior.
vior. systems can b e mo deled successfully as normally
distributed noise, even if the system can be decomposed into parts with more
Second, out of all possible probability distributions with the same variance,
structured behavior.
the normal distribution enco encodes
des the maxim
maximum um amount of uncertaint
uncertainty y ov
over
er the
Second,
real num
umb out of all
bers. We can th p ossible
thus probability distributions with the same
us think of the normal distribution as being the one that v ariance,
the normal distribution
amountt enco
inserts the least amoun des the
of prior kno maximum
knowledge
wledge intoamount
into a mo del.of F
model. uncertaint
Fully
ully dev y over and
developing
eloping the
real num b ers. W e can th us think of the
justifying this idea requires more mathematical tonormal distribution
tools, as b
ols, and is postpeing the
postponed one that
oned to Sec.
inserts
19.4.2.. the least amount of prior knowledge into a model. Fully developing and
19.4.2
justifying this idea requires more mathematical tools, and is postponed to Sec.
The normal distribution generalizes to Rn, in whic whichh case it is known as the
19.4.2.
multivariate normal distribution
distribution.. It ma
mayy be Rparametrized with a positiv ositivee definite
The normal
symmetric matrix Σ: distribution generalizes to , in whic h case it is known as the
multivariate normal distribution. It may be parametrized with a positive definite
s  
symmetric matrix Σ: 1 1 > −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) . (3.23)
(2π) ndet(Σ) 2
1 1
(x; µ, Σ) = exp (x µ) Σ (x µ) . (3.23)
(2π) det(Σ) 2
The N parameter µ still giv gives
es the mean− of the − distribution, − though no now w it is
vector-v alued. The parameter Σ giv
ector-valued. gives
es the cov
covariance
ariance matrix of the distribution.
The parameter
As in the univ µ still gives the
ariatescase, when we wish
univariate mean of
to ev the distribution,
evaluate
aluate the PDF thoughsev eral no
several w itfor
times is

vector-valued. The parameter Σ gives the covariance matrix of the distribution. 
As in the univariate case, when we wish 64 to evaluate the PDF several times for
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

man
many y differen
differentt values of the parameters, the co covvariance is not a computationally
efficien
efficientt wa
wayy to parametrize the distribution, since we need to inv ert Σ to ev
invert evaluate
aluate
man y differen t values of the
the PDF. We can instead use a pr parameters, the co variance
preecision matrix β: is not a computationally
efficient way to parametrize the distribution, since we need to invert Σ to evaluate
the PDF. We can instead usesa precision matrix  β: 
−1 det( β ) 1 >
N (x; µ, β ) = exp − (x − µ) β(x − µ) . (3.24)
(2π)n 2
det(β) 1
(x; µ, β ) = exp (x µ) β(x µ) . (3.24)
We often fix the co variance (2
cov π) to be a2diagonal matrix. An even simpler
matrix
version is theNisotr
isotropic
opic Gaussian distribution, − whose− co
cov − matrix is a scalar
variance
Wethe
times often
idenfix
identit
titythe
tity covariance matrix to be a diagonal matrix. An even simpler
matrix.
version is the isotropic Gaussians distribution,  whose covariance matrix  is a scalar
times the identity matrix.
3.9.4 Exp
Exponen
onen
onential
tial and Laplace Distributions

3.9.4
In Exponen
the context tiallearning,
of deep and Laplace
we oftenDistributions
wan
wantt to hav
havee a probability distribution
with a sharp point at x = 0. To accomplish this, we can use the exp exponential
onential
In the context
distribution
distribution:: of deep learning, w e often wan t to hav e a probability distribution
with a sharp point at x =p(0x.; λT)o=accomplish
λ1x≥0 exp (− this,
λx) w. e can use the exponential
(3.25)
distribution:
The exp
exponen
onen
onential
tial distribution
p(uses
x; λ) the
= λindicator λx) . 1 x≥0 to assign probabilit
1 exp (function probability
(3.25)y
zero to all negativ
negativee values of x.
The exponential distribution uses the indicator function − 1 to assign probability
A closely related probabilit
probability
y distribution that allo allows
ws us to place a sharp peak
zero to all negative values of x.
of probabilit
probability y mass at an arbitrary poin ointt µ is the Laplac
aplacee distribution
A closely related probability distribution that  allows us  to place a sharp peak
of probability mass atLaplace(
an arbitrary 1 | x − µ |
x; µ, γ )poin
= t µexpis the−Laplace distribution
. (3.26)
2γ γ
1 x µ
Laplace(x; µ, γ ) = exp . (3.26)
2γ | − γ |
3.9.5 The Dirac Distribution and Empirical − Distribution

3.9.5
In The Dirac
some cases, we wishDistribution
to sp ecify that and
specify all of Empirical
themass in aDistribution
probabilit
probability
y distribution
clusters around a single poin
oint.
t. This can be accomplished by defining a PDF using
In some
the Diraccases,
deltawe wish to δsp
function, (xecify
): that all of the mass in a probability distribution
clusters around a single point. This can be accomplished by defining a PDF using
the Dirac delta function, δ(x): p(x) = δ(x − µ). (3.27)

The Dirac delta function is defined p(x)such


= δ(that
x µit).is zero-v
zero-valued
alued everywhere except (3.27)
0, yet integrates to 1. The Dirac delta function − is not an ordinary function that
The
asso Dirac
associates delta
ciates eac function is defined
h value x with a real-v
each such
real-valued that it is
alued output, insteadzero-valued
it is everywhere
a different kindexcept
of
0, yet integrates
mathematical ob to
object1. The Dirac
ject called a gener delta
generalize
alizefunction
alized is not an ordinary function
d function that is defined in terms of its that
asso
prop ciates
erties eac
properties when alue x withW
h vintegrated. a ereal-valued of
can think output,
the Diracinstead
deltait function
is a different kindthe
as being of
mathematical
limit ointt of aob
poin ject called
series a generthat
of functions alizeput
d function
less andthatless ismass
defined in pterms
on all of its
oints other
prop erties
than µ. when integrated. W e can think of the Dirac delta function as b eing the
limit point of a series of functions that put less and less mass on all points other
than µ. 65
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

By defining p( x) to be δ shifted by −µ we obtain an infinitely narrow and


infinitely high peak of probabilit
probability y mass where x = µ.
By defining p( x) to be δ shifted by µ we obtain an infinitely narrow and
A common use of the Dirac delta distribution is as a component of an empiric empiricalal
infinitely high peak of probability mass where − x = µ.
distribution
distribution,,
A common use of the Dirac delta 1distributionXm is as a component of an empirical
distribution, pp̂ˆ(x) = δ(x − x(i)) (3.28)
m
1 i=1
pˆ(x1 ) = δ(x x ) (3.28)
whic
which h puts probability mass m on m eac
eachh of the m poin ts x (1) , . . . , x(m) forming
oints

a giv
given
en data set or collection of samples. The Dirac delta distribution is only
which puts
necessary to probability mass distribution
define the empirical on each of ov the
overer m poin
contin ts xvariables.
continuous
uous , . . . , x Forforming
discrete
a given data set or collection of samples.
variables, the situation is simpler: an empirical X The Dirac delta distribution
distribution can be conceptualized is only
necessary
as to define
a multinoulli the empirical
distribution, distribution
with over asso
a probability contin uous to
associated
ciated variables.
each possibleFor discrete
input
v ariables, the situation is simpler:
value that is simply equal to the empiric an empirical
empiricalal fr distribution can b e conceptualized
freequency of that value in the training
as
set.a m ultinoulli distribution, with a probability associated to each possible input
value that is simply equal to the empirical frequency of that value in the training
We can view the empirical distribution formed from a dataset of training
set.
examples as sp ecifying the distribution that we sample from when we train a model
specifying
W e can view the empirical
on this dataset. Another imp
important distribution
ortant persp formed
erspective
ective on thefrom a dataset
empirical of training
distribution is
examples as sp ecifying
that it is the probabilit
probabilitythe distribution that w e sample
y density that maximizes the likelihoo from when
likelihood w e train a
d of the training data model
on this dataset.
(see Sec. 5.5). Another imp ortant p ersp ective on the empirical distribution is
that it is the probability density that maximizes the likelihood of the training data
(see Sec. 5.5).
3.9.6 Mixtures of Distributions

3.9.6
It is alsoMixtures
common toof Distributions
define probability distributions by com combining
bining other simpler
probabilit
probability y distributions. One common w waay of com
combining
bining distributions is to
It is also common
construct a mixtur to define probability
mixturee distribution distributions
distribution.. A mixture distributionby comis bining
made other
up of simpler
several
probabilit
comp
componen
onen y distributions. One
onentt distributions. On eaceach common w ay of com
h trial, the choice of whicbining
whichh comp distributions
component is to
onent distribution
construct the
generates a mixtur e distribution
sample . A mixture
is determined by samplingdistribution
a comp is made
component
onent uptit
iden
identitof
y several
tity from a
comp onen t distributions.
multinoulli distribution: On each trial, the choice of whic h comp onent distribution
generates the sample is determined by sampling a component identity from a
X
multinoulli distribution: P (x) = P (c = i)P (x | c = i) (3.29)
i
P (x) = P (c = i)P (x c = i) (3.29)
where P (c) is the multinoulli distribution ov
over
er comp
component
| onent identities.
We ha
havve already seen one example of a mixture distribution: the empirical
where P (c) is the multinoulli distribution over component identities.
distribution ovover
er real-v
real-valued
alued variables is a mixture distribution with one Dirac
W e ha v e already seen X
oneexample.
example of a mixture distribution: the empirical
comp
componen
onen
onentt for eac
each
h training
distribution over real-valued variables is a mixture distribution with one Dirac
compThe mixture
onen mo
model
t for eac hdel is one simple
training strategy for combining probability distributions
example.
to create a ric
richer
her distribution. In Chapter 16, we explore the art of building complex
The mixture
probabilit
probability model is one
y distributions fromsimple strategy
simple ones informore
combining
detail. probability distributions
to create a richer distribution. In Chapter 16, we explore the art of building complex
probability distributions from simple ones 66 in more detail.
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

The mixture mo model


del allows us to briefly glimpse a concept that will be of
paramoun
paramountt imp importance
ortance later—the latent variable. A laten latentt variable is a random
The mixture mo del allows
variable that we cannot observe directly us to briefly
directly.. The component glimpse a iden
concept
identit
tit
tityy vthat
ariable will
c ofbetheof
paramoun
mixture mo t del
modelimpprovides
ortance an later—the
example.latent Laten
Latent variable . A may
t variables latenbtevrelated
ariable to is xa through
random
vthe
ariable
joint distribution, in this case, P (x, c) = P (x | c )P (c ). The distributionofPthe
that w e cannot observe directly . The component iden tit y v ariable c ( c)
omixture
ver the mo delt provides
laten
latent variable an andexample. Latent vP
the distribution ariables may be related
(x | c) relating the latent to xvariables
through
thethe
to joint distribution,
visible variables in this case,the
determines P (xshap
, c) =
shape e of x cdistribution
P (the )P (c ). ThePdistribution
(x ) ev
even P ( c)
en though
over
it the
is p latenttovariable
ossible describeand P (xthe distribution
) without P (x |to
reference c) relating
the latent thevariable.
latent variables
Laten
Latentt
to the visible v ariables determines
variables are discussed further in Sec. 16.5. the shap e of the
| distribution P ( x ) ev en though
it is possible to describe P (x) without reference to the latent variable. Latent
A veryare
variables podiscussed
werful andfurthercommon type16.5
in Sec. of mixture
. mo
model
del is the Gaussian mixtur mixturee
mo
model,
del, in whic
which h the comp
componen onen
onentsts p (x | c = i ) are Gaussians. Each comp componentonent has
A v ery p o w erful and common
a separately parametrized mean µ and cov (type
i ) of mixture mo del
( i ) is the
ariance Σ . Some mixtures can
covariance Gaussian mixtur
haveee
hav
mo del, in whic h the comp
more constraints. For example, the cov onen ts p ( x c =
covariances i ) are Gaussians. Each
ariances could be shared across comp comp onent
componen
onenhas
onents ts
a separately
via the constraintparametrized
Σ(i) = Σmean ∀i. Asµ with and covariance
| a single Σ . distribution,
Gaussian Some mixtures thecan have
mixture
more
of constraints.
Gaussians might Forconstrain
example, the the cov co
covvariances
ariance matrixcould befor shared
eac
eachh across
component componento be ts
via the constraint
diagonal or isotropic. Σ = Σ i. As with a single Gaussian distribution, the mixture
of Gaussians might constrain ∀ the covariance matrix for each component to be
In addition to the means and cov covariances,
ariances, the parameters of a Gaussian mixture
diagonal or isotropic.
sp
specify
ecify the prior pr prob
ob ability α i = P ( c = i) giv
obability givenen to each comp onent i. The word
component
In addition to the means
“prior” indicates that it expresses the mo and cov ariances,
model’s the parameters
del’s beliefs about c before of a Gaussian mixture
it has observed
xsp. ecify the prior prPob
By comparison, ( cability α = P
| x) is a posterior ( c = i)prgiv
prob
ob en to ,each
obability
ability
ability, because comp itonent i
is computed. The after
word
“prior”
observ indicates that it expresses
ation of x. A Gaussian mixture mo
observation the mo del’s beliefs
model about c b
del is a universal apprefore it has
approximator observed
oximator of
. By comparison,
xdensities, in the sense x) isan
P ( c that ayposterior
any smo
smooth oth pr obability
densit
density y can, because it is computed
be approximated withafter
an
any y
observ
sp
specific, ation of x
ecific, non-zero amoun . A Gaussian
| mixture mo del is a
amountt of error by a Gaussian mixture model with enoughuniversal appr oximator of
densities,
comp
componenonen
onents.in
ts. the sense that any smooth density can be approximated with any
specific, non-zero amount of error by a Gaussian mixture model with enough
Fig. 3.2 sho showsws samples from a Gaussian mixture mo model. del.
components.
Fig. 3.2 shows samples from a Gaussian mixture model.
3.10 Useful Prop
Properties
erties of Common Functions

3.10 functions
Certain Usefularise Propoftenerties of Common
while working Functions
with probabilit
probability
y distributions, especially
the probabilit
probabilityy distributions used in deep learning mo
models.
dels.
Certain functions arise often while working with probability distributions, especially
the One of these
probabilit functions is the
y distributions usedlo
logistic
gistic
in deepsigmoid
sigmoid: : models.
learning
One of these functions is the logistic sigmoid 1 :
σ(x) = . (3.30)
1 + exp(−x)
1
σ(x) = . (3.30)
The logistic sigmoid is commonly used1 to + exp(
pro
produce
ducex) the φ parameter of a Bernoulli
distribution because its range is (0 (0,, 1)
1),, whic
whichh lies
− within the valid range of values
The
for the φ parameter. See Fig. 3.3 for a graph of thethe
logistic sigmoid is commonly used to pro duce φ parameter
sigmoid function.ofThe
a Bernoulli
sigmoid
distribution because its range is (0, 1), which lies within the valid range of values
for the φ parameter. See Fig. 3.3 for a graph 67 of the sigmoid function. The sigmoid
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

x2

x1

Figure 3.2: Samples from a Gaussian mixture mo model.


del. In this example, there are three
comp
componen
onen
onents.
ts. From left to righ
right,
t, the first comp
component
onent has an isotropic co cov
variance matrix,
meaning it has the same amount of variance in each direction. The secondthere
Figure 3.2: Samples from a Gaussian mixture mo del. In this example, has aare three
diagonal
comp
co
cov onen ts. F rom left to righ t, the first comp onent has an isotropic cov ariance
variance matrix, meaning it can control the variance separately along each axis-aligned matrix,
meaning itThis
direction. has the same has
example amount
moreofvariance
variancealong
in each
thedirection. Thealong
x 2 axis than secondthehas
x 1 aaxis.
diagonal
The
covariance
third comp matrix,
componen
onen meaning
onentt has it cancov
a full-rank control
covariancethematrix,
ariance varianceallowing
separately along
it to eachthe
control axis-aligned
variance
direction. This
separately alongexample has more
an arbitrary basisvariance along the x axis than along the x axis. The
of directions.
third comp onent has a full-rank covariance matrix, allowing it to control the variance
separately along an arbitrary basis of directions.
function satur
saturates
ates when its argument is very positiv ositivee or very negative, meaning
that the function becomes very flat and insensitiv
insensitivee to small changes in its input.
function saturates when its argument is very positive or very negative, meaning
Another commonly encountered function is the softplus function (Dugas et al.,
that the function becomes very flat and insensitive to small changes in its input.
2001
2001):
):
Another commonly encountered function
ζ (x) = log is the
(1 + exp( x))softplus
. function (Dugas(3.31)
et al.,
2001):
The softplus function can be useful
ζ (x) =for
logpro
producing
(1 ducing
+ exp(x )) .β or σ parameter of a normal
the (3.31)
distribution because its range is (0, ∞ ). It also arises commonly when manipulating
The softplusin
expressions function
invvolving can be useful
sigmoids. Theforname
producing
of the the β or σ
softplus parameter
function of afrom
comes normal
the
distribution b ecause
fact that it is a smo its
smoothedrange is (0, ). It also
othed or “softened” version of arises commonly when manipulating
expressions involving sigmoids. The ∞name of the softplus function comes from the
x+ = max(0
fact that it is a smoothed or “softened” , x)of
version . (3.32)
See Fig. 3.4 for a graph of the softplus function.
x = max(0 , x). (3.32)
See The
Fig. follo
following
3.4 wing
for a prop
properties
grapherties aresoftplus
of the all useful enough that you may wish to memorize
function.
them:
The following properties are all useful enough that you may wish to memorize
them: exp(x)
σ(x) = (3.33)
exp(x) + exp(0)
exp(x)
σd(x) = (3.33)
σ(x) =exp( ) +−
σ(x)(1 exp(0)
σ(x)) (3.34)
dx
d
σ(x) = σ68
(x)(1 σ(x)) (3.34)
dx

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

The logistic sigmoid function


1.0
The logistic sigmoid function
0.8

0.6
¾(x)

0.4

0.2

0.0

−10 −5 0 5 10
x
Figure 3.3: The logistic sigmoid function.

Figure 3.3: The logistic sigmoid function.

The softplus function


10
The softplus function
8

6
³(x)

0
−10 −5 0 5 10
x
Figure 3.4: The softplus function.

Figure 3.4: The softplus function.

69
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

1 − σ(x) = σ (−x) (3.35)


log
1 σσ(x (x))==−σζ((−xx)) (3.36)
(3.35)
log− dσ(x) = ζ (− x)
ζ (x) = σ (x) (3.36)
(3.37)
dx
d − − 
ζ−1
(x) = σ (x) x (3.37)
∀x ∈ (0, 1)dx , σ (x) = log (3.38)
1−x
x
∀xx > (0
0,, 1) , σ
ζ −1 (x) =(x)log
= (exp(
log x) − 1)
1 x
(3.38)
(3.39)
∀ ∈ Z x
x > 0,ζ ζ(x) (= x) = log − (3.39)
σ((exp(
y)dy x) 1) (3.40)
∀ −∞
 − 
ζζ((xx))= σ (
− ζ (−x) = x y ) dy (3.40)
(3.41)
The function σ−1 (x) is called the ζ (xlo
logit
) git ζin
( statistics,
x) = x but this term is more (3.41) rarely
used in mac
machine
hine learning. − Zin−statistics, but this term is more rarely
The function σ (x) is called the logit
Eq. 3.41 pro provides
vides extra justification for the name “softplus.” The softplus
used in machine learning.
function is intended as a smo othed version of the positive part function, x + =
smoothed
max
max{Eq.
{0, x3.41 provides
} . The extra
positive partjustification
function isfor thethe name “softplus.”
counterpart of the ne The softplus
negative
gative part
function is−intended
function, x = max max{ as a smo othed
{0, −x}. To obtain a smo v ersion
smoothof the positive p art function,
oth function that is analogous to the x =
max 0, x . The p ositive part function is the
negativee part, one can use ζ (−x ). Just as x can be recov
negativ counterpart
recovered of the negative
ered from its positive part part
function,
and {negativ
negative maxvia
}x e=part 0, the
x .iden
To tit
identitobtain
tityy x+ a−smo x− oth
= x,function
it is alsothat is analogous
possible to reco to
verthe
recov x
negativ e part, one can
{ use

using the same relationship betζ
} ( x ) . Just as x can b e recov ered from
ween ζ (x) and ζ (−x), as shown in Eq. 3.41.
etw its p ositive part
and negative part via the iden − tity x x = x, it is also possible to recover x
using the same relationship between ζ (− x) and ζ ( x), as shown in Eq. 3.41.
3.11 Ba Bay yes’ Rule −

3.11
W e often Bafindyourselv
es’ Rule
ourselveses in a situation where we know P ( y | x) and need to know
P ( x | y). Fortunately
ortunately,, if we also know P (x), we can compute the desired quantit quantityy
W e often find
using Bayes’ rule: ourselv es in a situation where we know P ( y x) and need to know
P ( x y). Fortunately, if we also know PP(x(x),)P | x)compute
w(eycan | the desired quantity
using| Bayes’ rule: P ( x | y) = . (3.42)
P (y)
P (x)P (y x)
Note that while P (y ) appears P (xin ythe
) =formula, it is . usually feasible to compute(3.42)
P P (y) |
P (y) = x P (y | x)P (x), so we do | not need to begin with knowledge of P (y).
Note that while P (y ) appears in the formula, it is usually feasible to compute
P (yBa
Bay
) =yes’ rule
P (y isxstraigh
straightforward
)P (x), tforward
so we do not to deriv
derive
needetofrom
begin the definition
with of conditional
knowledge of P (y).
probabilit
probability y, but it is useful to know the name of this form formula
ula since many texts
| straightforward to derive from the definition of conditional
referBatoyes’
it brule is
y name. It is named after the Reverend Thomas Ba Bayyes, who first
probabilit
disco
discov y, but
vered a sp it
special is useful to know the name of this form
ecial case of the formula. The general version presented ula since manyheretexts
was
refer
indep to
independen it
enden b
endentlyy name. It is named after the Reverend Thomas Ba yes, who first
P tly disco
discovvered by Pierre-Simon Laplace.
discovered a special case of the formula. The general version presented here was
independently discovered by Pierre-Simon Laplace.
70
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.12 Tec
echnical
hnical Details of Con
Contin
tin
tinuous
uous Variables

3.12
A prop
proper er T echnical
formal Details
understanding of ofcon Con
contin
tin
tinuous
uous tin uous vV
random ariables
ariables and probabilit
probability y
densit
density y functions requires dev developing
eloping probabilit
probability y theory in terms of a branc branch h of
A prop er
mathematics knoformalknown understanding
wn as me measur
asur of
asuree the con
theory
ory tin uous random
ory.. Measure theory is bey v ariables and
eyond probabilit
ond the scop scopee of y
densit
this textby
textbofunctions requires
ook, but we can briefly sk dev eloping
sketc
etc
etch probabilit y theory in terms
h some of the issues that measure theory is of a branc h of
mathematics
emplo
employ known
yed to resolv
resolve. e. as measure theory. Measure theory is beyond the scope of
this textbook, but we can briefly sketch some of the issues that measure theory is
emplo In ySec.
ed to3.3.2 , wee.sa
resolv saw w that the probabilit
probability y of a concontin
tin
tinuous
uous vector-v alued x lying
vector-valued
in some set S is given by the in tegral of p( x) ov
integral over
er the set S. Some choices of set S
In
can pro Sec.
produce 3.3.2 , w e sa w that the probabilit
duceSparadoxes. For example, it is possible y of a conto tinconstruct
uous vector-v
S tw
two sets x
oalued S 1lying
and S
in some
S suc
such h set pis(xgiven
that ∈ S by
) + the
p (x in
∈ tegral
S ) > of
1 pbut
( x)Sov∩ erSthe=set ∅ . . Some
These choices
sets are of set
generally
2 1 2 1 2 S
can produce making
constructed paradoxes. very Fheavy
or example,
use of ittheis infinite
possibleprecision
to construct of twonum
real setsbers,and
numbers, for
S S S S S
suc h that p (x
example by making fractal-shap ) + p (x
fractal-shaped ) > 1 but = . These
ed sets or sets that are defined by transforming sets are generally
constructed making
the set of rational num ∈ very
numbers. heavy
2 ∈ use of the infinite
bers. One of the key contributions ∩ precision of real num
∅ of measure bers,is for
theory to
example
pro vide abcharacterization
provide y making fractal-shap of theed setsets
of setsor sets
thatthatwe canare defined
computeby thetransforming
probability
thewithout
of set of rational numbers.
encountering One of In
paradoxes. thethis keybo contributions
book,ok, we only in oftegrate
measure
integrate ovtheory
er sets iswithto
provideely
relativ
relatively a characterization
simple descriptions, of the so set
thisofasp sets
aspect ectthat we can compute
of measure theory nev the
never er probability
becomes a
of without
relev
relevant encountering
ant concern. paradoxes. In this bo ok, w e only in tegrate ov er sets with
relatively simple descriptions, so this aspect of measure theory never becomes a
relev Fant
or our purp
purposes,
concern. oses, measure theory is more useful for describing theorems that
apply to most points in R n but do not apply to some corner cases. Measure theory
pro For our
provides
vides purposes,
a rigorous wa
way measure theorythat
y Rof describing is morea setuseful
of poinfortsdescribing
oints is negligibly theorems
small. Suchthat
apply to most
a set is said to hav p oints
havee “ me in
measur
asur but
asuree zer do
zero not
o.” W apply to some corner cases.
Wee do not formally define this concept in thisMeasure theory
provides
textb
textbo ook.a Ho
rigorous
How wev er,wa
ever, it yisofuseful
describing that a setthe
to understand of pinoin ts is negligibly
intuition
tuition that a setsmall. Such
of measure
a set oisccupies
zero said tonohav e “ measur
volume e zer
in the o.” W
space wee doarenot formally Fdefine
measuring. this concept
or example, withinin R2this
,a
textb o ok. Ho w ev er, it is useful to understand
line has measure zero, while a filled polygon has positiv the in tuition that a set
ositivee measure. Likewise,of measure
R an
zero o ccupies no volume
individual point has measure zero. An in the space we
Any are measuring. F or example,
y union of countably many sets that each within ,a
line
ha
hav vehas measure
measure zerozero,
alsowhile a filled pzero
has measure olygon (so has
the p setositiv e measure.
of all the rational Likewise,
num
numb an
bers
individual
has measure point
zero,hasformeasure
instance). zero. Any union of countably many sets that each
have measure zero also has measure zero (so the set of all the rational numbers
has Another
measure useful
zero, forterm from measure theory is “ almost everywher
instance). everywheree.” A prop property
erty
that holds almost ev everywhere
erywhere holds throughout all of space except for on a set of
Another
measure zero.useful
Becausetermthe from measure otheory
exceptions ccupy isa “negligible
almost everywher
amounteof .” space,
A proptheyerty
thatbholds
can almost
e safely ignored everywhere
for manmany y holds throughout
applications. Some allimp
of space
ortanttexcept
importan
ortan resultsfor on a set of
in probability
measure
theory zero.
hold for Because
all discrete thevalues
exceptions
but only occupyhold a“almost
negligible amount for
everywhere” of space,
con
contin
tin they
tinuous
uous
can
values.b e safely ignored for man y applications. Some imp ortan t results in probability
theory hold for all discrete values but only hold “almost everywhere” for continuous
Another tec technical
hnical detail of contin continuousuous variables relates to handling contin continuous
uous
values.
random variables that are deterministic functions of one another. Supp Suppose ose we ha hav ve
Another tec hnical detail
two random variables, x and y , suc of contin
such uous v ariables relates
h that y = g (x ), where g is an inv to handling contin
invertible, uous
ertible, con-
random variables that are deterministic functions of one another. Suppose we have
2
twoThe Banach-Tarski
random variables, theorem
x andprovides
y , sucha fun
thatexample
y = g (of x )such sets. g is an invertible, con-
, where
71
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

tin
tinuous,
uous, differen
differentiable
tiable transformation. One migh mightt exp expectect that py (y ) = p x (g−1(y ))
))..
This is actually not the case.
tinuous, differentiable transformation. One might expect that p (y ) = p (g (y )).
As a simple example, supp supposeose we ha havve scalar random variables x and y. Suppose
This isx actually not the case.
y = 2 and x ∼ U (0 (0,, 1)
1).. If we use the rule py (y ) = p x(2 y) then p y will be 0
ev As a
everywheresimple example,
erywhere except the interv suppalose
interval [0,,we
1 have scalar random variables x and y. Suppose
[0 2 ], and it will b e 1 on this in interv
terv
terval.
al. This means
y = and x U (0, 1). If we use the rule p (y ) = p (2 y) then p will be 0
Z
everywhere except ∼ the interval [0, ]p, and it will1 be 1 on this interval. This means
y (y )dy = , (3.43)
2
1
p (y)dy = , (3.43)
whic
which h violates the definition of a probabilitprobability y distribution.
2
whicThis common
h violates the mistake
definition is of
wrong becausey it
a probabilit fails to accoun
distribution. accountt for the distortion
of space inintro
tro duced by the function g . Recall that the probability of x lying in
troduced
an infinitesimally mistake
This common small regionis wrong Z bvecause
with olume it δxfailsis givtoenaccoun
given by p(tx for
)δx.the distortion
Since g can
of space in
expand or contro duced by the function g . Recall that the
tract space, the infinitesimal volume surrounding x in x space ma
contract probability of x lying in
mayy
an
ha
havvinfinitesimally
e differen
differentt volumesmall in region
y space. with volume δx is given by p(x )δx. Since g can
expand or contract space, the infinitesimal volume surrounding x in x space may
To see ho
how
w to correct the problem, we return to the scalar case. We need to
have different volume in y space.
preserv
preservee the prop
propertert
erty
y
To see how to correct the |pyproblem,
(g(x))dy dy|| w=e |return
p x (x)dx to|. the scalar case. We need to
(3.44)
preserve the property
Solving from this, we obtain p (g(x))dy = p (x)dx . (3.44)
 
Solving from this, we obtain| | | ∂ x |
p y (y) = px (g−1 (y))   (3.45)
∂y
∂x
p (y) = p (g (y)) (3.45)
or equiv
equivalently
alently  ∂y
 ∂ g ( x) 
p ( x ) = p ( g ( x ))   (3.46)
or equivalently x y  ∂x  .
∂ g(x) 
p (ative
x) = pgeneralizes
(g(x)) to  . determinan (3.46)
In higher dimensions, the deriv derivative
∂xi
∂xthe 
determinantt of the JacJacobian
obian
matrix
matrix—the
—the matrix with Ji,j = ∂y j . Th Thus,
us, for real-v  alued vectors x and y ,
real-valued
In higher dimensions, the derivative generalizes to the determinant of the Jacobian
matrix—the matrix with J = . Thus,  forreal-valued 
  ∂ g(x )  vectors x and y ,
px (x) = py (g(x)) det  . (3.47)
 ∂ x 
 ∂ g(x )
p (x) = p (g(x)) det . (3.47)
∂x
3.13 Information Theory

3.13 Information
Information Theory
theory is a branc
branch
h of applied   that rev
mathematics revolv
olv
olves
es around
 
quan
quantifying
tifying how muc
much h information is presen
present 
 t in a signal. It was originally inv
inven
en
ented
ted
Information
to theory
study sending is a branc
messages fromhdiscrete
of applied mathematics
alphab
alphabets that channel,
ets over a noisy revolves such
aroundas
quan
comm tifying
communicationhow muc h information is presen t in a signal. It w as originally inv
unication via radio transmission. In this context, information theory tells howen ted
to design sending
to study messages
optimal co des and from
codes discrete
calculate alphab
the exp ectedets
expected over of
length a messages
noisy channel,
sampledsuch as
from
communication via radio transmission. In this context, information theory tells how
to design optimal codes and calculate the72expected length of messages sampled from
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

sp
specific
ecific probability distributions using various enco encoding
ding schemes. In the con context
text of
mac
machine
hine learning, we can also apply information theory to con contin
tin
tinuous
uous variables
specificsome
where probability
of these distributions
message length using interpretations
various encodingdoschemes.
not apply In. the
apply. confield
This text is
of
mac hine learning,
fundamen
fundamental tal to many we areas
can also apply information
of electrical engineeringtheory to continscience.
and computer uous variables
In this
where
textb
textbo some of these message length interpretations do not apply
ook, we mostly use a few key ideas from information theory to characterize . This field is
fundamental
probabilit
probability to many areas
y distributions or of electrical
quantify engineering
similarity betw and
een computer
etween probability science. In this
distributions.
textb
F ook, detail
or more we mostly use a few key
on information ideas
theory
theory, from
, see Co
Covvinformation
er and Thomas theory
(2006to)characterize
or MacKa
MacKay y
probabilit
(2003). y distributions or quantify similarity b etw een probability distributions.
For more detail on information theory, see Cover and Thomas (2006) or MacKay
The basic intuition behind information theory is that learning that an unlik unlikely
ely
(2003).
ev
even
en
entt has occurred is more informative than learning that a lik likely
ely ev
event
ent has
The basic
occurred. intuitionsaying
A message behind“theinformation
sun rosetheory is that learning
this morning” that an unlikely
is so uninformative as
ev en t has occurred is more informative
to be unnecessary to send, but a message sa than learning
saying that a likely ev
ying “there was a solar eclipse thisent has
o ccurred. A message saying
morning” is very informative. “the sun rose this morning” is so uninformative as
to be unnecessary to send, but a message saying “there was a solar eclipse this
We would like to quantify information in a way that formalizes this intuition.
morning” is very informative.
Sp
Specifically
ecifically
ecifically,,
We would like to quantify information in a way that formalizes this intuition.
Specifically
• Lik ely, ev
Likely even
en
ents
ts should ha
havve lolow
w information con conten
ten
tent,
t, and in the extreme case,
ev
even
en
ents
ts that are guaranteed to happen should ha hav
ve no information conten
contentt
Likely
whatso
whatsoevev
even
ever.
er.ts should have low information content, and in the extreme case,
• events that are guaranteed to happen should have no information content
• Less lik
likely
whatso elyer.ev
ev evenen
ents
ts should ha
havve higher information con
conten
ten
tent.
t.

• Less
Indeplikely ev
Independen
enden
endent t en
evts
even
en should
ents
ts haha
should vevhigher
hav information
e additiv
additive conten
e information. t. example, finding
For
• out that a tossed coin has come up as heads twice should conv convey
ey twice as
Indep
muc h information as finding out that a tossed coin has come up asfinding
uch enden t even ts should ha ve additive information. For example, heads
out that a tossed coin has come up as heads twice should convey twice as
• once.
much information as finding out that a tossed coin has come up as heads
In once.
order to satisfy all three of these prop
properties,
erties, we define the self-information
of an ev entt x = x to be
even
en
In order to satisfy all three Iof(xthese
) = −proplog Perties,
(x). we define the self-information(3.48)
of an event x = x to be
In this book, we alwa ys use logIto
always (x)mean
= the (x). logarithm, with base e(3.48)
log Pnatural . Our
definition of I( x) is therefore written in − unitsnatural
of nats. One nat is the amount of
In this book,gained
information we alwa use log to
byysobserving anmean
eventtthe
even of probabilitylogarithm,
1 with base e. Our
e . Other texts use base-2
definition ofand
logarithms ) is therefore
I( xunits written
called bits in units
or shannons of nats. One
; information nat is the
measured in amount of
bits is just
ainformation
rescaling ofgained by observing
information an even
measured t of probability . Other texts use base-2
in nats.
logarithms and units called bits or shannons; information measured in bits is just
When x is contin
continuous, we use the same definition of information by analogy
uous,measured analogy,,
a rescaling of information in nats.
but some of the prop properties
erties from the discrete case are lost. For example, an even eventt
When x is contin uous, we use the same definition of
with unit density still has zero information, despite not being an ev information b
entt that is,
eveny
en analogy
but
guaransome
guaranteed of the prop
teed to occur. erties from the discrete case are lost. For example, an event
with unit density still has zero information, despite not being an event that is
guaranteed to occur. 73
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Shannon entropy of a binary random variable


0.7

0.6
Shannon entropy of a binary random variable
Shannon entropy in nats 0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
p
Figure 3.5: This plot sho showsws ho
how
w distributions that are closer to deterministic hav havee lo
low
w
Shannon entrop
entropy y while distributions that are close to uniform hav havee high Shannon en entrop
trop
tropy y.
Figure 3.5:
On the horizon This
horizontal plot sho ws how distributions
tal axis, we plot , the probabilit
p probability that are closer to deterministic
y of a binary random variable being equal hav e low
Shannon
to 1. Theentrop
en
entrop
trop
tropyyywhile
is giv distributions
given
en that
by (p − 1) log
log(1
(1
(1−are
− p )close to puniform
− p log . When phav is enear
high0,Shannon entropy.
the distribution
Onnearly
is the horizon tal axis, bwecause
deterministic, p
e plot the
, the probabilit
random y of aisbinary
variable nearly random
alwa
always variable
ys 0. When bpeing equal
is near 1,
to 1.distribution
the The entropyisisnearlygiven by (p 1) log(1b ecause
deterministic, p ) p the . When pvariable
log prandom is near is0, nearly
the distribution
alwa
alwaysys 1.
is nearly
When p =deterministic,
00..5, the en b ecause
entrop
trop
tropy −the random
y is maximal, − variable
because− the is nearly alwaisysuniform
distribution 0. When ovperisthe
over neartw1,
twoo
the distribution is nearly deterministic, b ecause the random variable is nearly always 1.
outcomes.
When p = 0.5, the entropy is maximal, because the distribution is uniform over the two
outcomes.
Self-information deals only with a single outcome. We can quantify the amounamountt
of uncertain
uncertaintty in an en
entire
tire probabilit
probabilityy distribution using the Shannon entrentropy
opy
opy::
Self-information deals only with a single outcome. We can quantify the amount
of uncertainty in an en
Htire probabilit
(x) = y )]distribution
Ex∼P [I (x = −Ex∼P [logusing
P (xthe
)]. Shannon entropy :
(3.49)
E E
also denoted H ( P ). In x) = words,
H (other [I (the
x)] = Shannon [logen P (x
entrop
trop
tropy y)]of. a distribution (3.49)
is the
exp
expected
ected amoun
amountt of information in an ev entt−dra
even
en drawn
wn from that distribution. It gives
also
a lo
low denoted H (
wer bound on the numP ) . In other
numb w ords, the Shannon
ber of bits (if the logarithm entrop y of2,a otherwise
is base distributiontheisunits
the
expected
are differen
different)amoun t of information
t) needed on av eragein
average toanencoeven
encode det symbols
drawn fromdrawnthatfromdistribution. It gives
a distribution P.
a low er bound on the num b er of bits (if the logarithm is base
Distributions that are nearly deterministic (where the outcome is nearly certain) 2, otherwise the units
are
ha
havvedifferen
lo
loww en t)
entrop
tropneeded
tropy; on averagethat
y; distributions to enco
are de symbols
closer drawnhav
to uniform from
have a distribution
e high entrop
entropy P.
y. See
Distributions
Fig. that are nearlyWhen
3.5 for a demonstration. deterministic
x is contin (where
uous, the
continuous, the outcome
Shannon entropis nearly
entropyy iscertain)
kno
known
wn
havthe
as e lodiffer
w entrop
differential
entialy; entr
distributions
entropy
opy
opy.. that are closer to uniform have high entropy. See
Fig. 3.5 for a demonstration. When x is continuous, the Shannon entropy is known
If we havhavee two separate probability distributions P ( x) and Q(x) ov over
er the same
as the differential entropy.
random variable x, we can measure ho how w different these two distributions are using
If
the Kul we
Kullb lbhav
lback-L
ack-Le t w
ack-Leiblero separate
eibler (KL) diver gencee: distributions P ( x) and Q(x) over the same
probability
divergenc
genc
random variable x, we can measure how different these two distributions are using
 
the Kul lback-Leibler (KL) divergenc P (x)e:
D KL(P kQ) = E x∼P log = E x∼P [log P (x) − log Q(x)] . (3.50)
Q(x)
E P (x) E
D (P Q) = log = [log P (x) log Q(x)] . (3.50)
Q(x) 74
k −

 
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

In the case of discrete variables, it is the extra amount of information (measured


in bits if we use the base 2 logarithm, but in machine learning we usually use nats
andInthethenatural
case of logarithm)
discrete variables,
neededittoissend the extra amount
a message of information
containing (measured
symbols drawn
in bits if we
from probabilit
probabilityuse the base 2 logarithm, but
y distribution P , when we use a co in machine
code learning w e usually
de that was designed to minimizeuse nats
and the natural logarithm)
the length of messages dra drawn needed to
wn from probabilit send
probability a message containing
y distribution Q. symbols drawn
from probability distribution P , when we use a code that was designed to minimize
The KL div divergence
ergence has many useful prop properties,
erties, most notably that it is non-
the length of messages drawn from probability distribution Q.
negativ
negative.e. The KL divergence is 0 if and only if P and Q are the same distribution in
The
the case of KLdiscrete
divergence has many
variables, or equal useful propev
“almost erties,
everywhere”
erywhere”mostinnotably
the casethat it tin
of con
continis uous
non-
tinuous
negativ e. The KL divergence
variables. Because the KL div is 0
divergenceif and only
ergence is non-negativif P and Q are the same distribution
non-negativee and measures the difference in
the
b et
etw
wcase
een oftwdiscrete variables,itorisequal
o distributions, often“almost everywhere”
conceptualized in the case of
as measuring contin
some uous
sort of
vdistance
ariables.bBecause
etw
etween the KL div ergence
een these distributions. How is non-negativ
However, e and measures the difference
ever, it is not a true distance measure
b etw een tw o distributions, it
because it is not symmetric: DKL( P kQ ) = is often conceptualized
6 DKL( QkP )asfor measuring
some P andsome Q.sortThisof
distance b etw een these distributions.
asymmetry means that there are imp importanHow
ortan ever, it is not a true distance
ortantt consequences to the choice of whether measure
because
to use Dit is
(P not
k Q )symmetric:
or D (Q k D). See
P ( P Fig.
Q ) =3.6 D for( Q moreP ) detail.
for some P and Q. This
KL KL
asymmetry means that there are impkortan6 t consequences k to the choice of whether
to use D (P Q) or D (Q P ). See Fig. 3.6 for more detail.is the cr
A quan
quantit tit
tity
y that is closely related to the KL div
divergence
ergence cross-entr
oss-entr
oss-entropyopy
H (P, Q ) = H ( P ) + DKL (P kQ), whic which h is similar to the KL div divergence
ergence but laclacking
king
A quan tityk that is closely k related to the KL divergence is the cross-entropy
the term on the left:
H (P, Q ) = H ( P ) + D (P HQ(P, ), whic
Q) =h − isEsimilar to the KL divergence but lac king
(3.51)
x∼P log Q(x).
the term on the left: k
Minimizing the cross-entrop
cross-entropy Eect to Q is equiv
H y(P,with
Q) = resp
respect log Q(x equivalent
). alent to minimizing the
(3.51)
KL divdivergence,
ergence, because Q do doeses not participate
− ect to Qinisthe omitted term.
Minimizing the cross-entropy with resp equivalent to minimizing the
KL When computing
divergence, because man
many
Qydo ofesthese quan
quantities,
tities, it
not participate in isthe
common
omittedtoterm.
encoun
encounter
ter expres-
sions of the form 0 log 0. By con conv ven
ention,
tion, in the con context
text of information theory
theory,, we
When computing man y of these
treat these expressions as limx→0 x log x = 00.. quan tities, it is common to encoun ter expres-
sions of the form 0 log 0. By convention, in the context of information theory, we
treat these expressions as lim x log x = 0.
3.14 Structured Probabilistic Mo
Models
dels

3.14
Mac hine Structured
Machine learning algorithms Probabilistic
often in
inv
volv Models
olvee probabilit
probabilityy distributions ovover
er a very
large num umb ber of random variables. Often, these probabilit
probabilityy distributions ininvvolv
olvee
Mac hine
direct in learning
interactions algorithms
teractions betwetween often inv olv e probabilit y distributions over
een relatively few variables. Using a single function to a very
large num
describ
describe ber entire
e the of random
jointt vprobabilit
join ariables. Often,
probability these probabilit
y distribution can be yvery
distributions
inefficienttin(both
inefficien volve
direct interactions
computationally andbetw een relatively few variables. Using a single function to
statistically).
describe the entire joint probability distribution can be very inefficient (both
Instead of using a single function to represen
representt a probability distribution, we
computationally and statistically).
can split a probability distribution in into
to man
many y factors that we multiply together.
For Instead
example, of supp
using
supposea we
ose single
havfunction
hav to represen
e three random t a probability
variables: a, b and cdistribution,
. Supp
Suppose
ose thatwe
acan split a probability
influences the value ofdistribution into manthe
b and b influences y factors
value that
of c, we
butmultiply together.
that a and c are
F or
indep example,
independen
enden
endentt givsupp
given ose we hav e three random v
en b. We can represent the probabilitariables:
probability a, b and c. Supp ose that
y distribution over all three
a influences the value of b and b influences the value of c, but that a and c are
independent given b. We can represent the probability distribution over all three
75
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

q ∗ = argminq D KL(pkq ) q ∗ = argmin q DKL (q kp)

q = argmin D (pp(xq)) q = argmin D (p(


q xp))
q∗k(x) q ∗k(x)
Probability Density

Probability Density
p(x) p(x)
q (x) q (x)

x x

Figure 3.6: The KL divergence is asymmetric. Suppose we ha havve a distribution p(x) and
wish to approximate it with another distribution q ( x). We hav havee the choice of minimizing
Figure D 3.6: p(x) and
either KL (The
pkq) KL
or D divergence
KL ( q kp). W iseasymmetric.
illustrate theSuppose
effect ofwethis havcehoice
a distribution
using a mixture of
wish to approximate it with another distribution q
two Gaussians for p, and a single Gaussian for q . The choice of whic ( x) . W e hav e the c
whichhoice of minimizing
h direction of the
either
KL D (p qto
divergence ) oruseDis problem-dep
( q p). We illustrate
problem-dependen enden
endent. the effect
t. Some of this crequire
applications hoice using a mixture of
an approximation
two Gaussians
that for p,high
usually kplaces and aprobabilit
k single Gaussian
probability y anywhere for qthat. Thethe choice
true ofdistribution
which direction placesofhigh
the
KL divergence
probabilit
probability to use is problem-dep enden t.
y, while other applications require an appro Some applications
approximation require an approximation
ximation that rarely places high
that usually
probabilit
probability y anplaces
ywherehigh
anywhere that probabilit y anywhereplaces
the true distribution that the low true distribution
probabilit
probability y. The choiceplacesofhigh
the
probabilitofy, the
direction while
KLother
div applications
divergence
ergence requireofan
reflects which appro
these ximation that
considerations takes rarely places
priorit
priorityy for high
eac
eachh
probabilit y
application. (L anywhere that the true distribution places low probabilit
eft) The effect of minimizing DKL ( pkq). In this case, we select a q that has
(Left) y . The choice of the
direction
high of theyKL
probabilit
probability divergence
where p has high reflects which yof. these
probabilit
probability Whenconsiderations
p has multipletakes mo
modes, priorit
des, y for
q cho oseseac
hooses toh
application.
blur the modes (Left) The effect
together, of minimizing
in order ( p q). yInmass
to put highDprobabilit
probability this on
case,allwofe them. q that The
select a(Right) has
high probabilit
effect y where
of minimizing DKL p (has
q kp)high
. In probabilit
this case, ywe. When k pa has
select multiple
q that has low des, q cho oses
moprobability to
where
pblur
hasthe
lowmodes
low together,
probabilit
probability y. Whenin order
p has tomput highmo
ultiple probabilit
modesdes thaty are mass on all of
sufficien
sufficiently tlythem. (Right)
widely separated, The
effect
as in this figure, theDKL (div
of minimizing ). In this
q pergence
divergence case, we select
is minimized q that ahas
by cahoosing low mo
single probability
mode,
de, in orderwhere
to
p has low probabilit y. When p has m ultiple
avoid putting probability mass in the low-probabilit
k mo des
low-probability that are
y areas b et sufficien
etw
ween mo tly
modes widely separated,
des of p. Here, we
as in this the
illustrate figure, the KL
outcome divergence
when q is chosenis minimized
to emphasize by choosing a singleWmo
the left mode. de, inalso
e could order
havto
have e
avhiev
ac oid putting
achiev
hieved probability
ed an equal value ofmassthe inKLthediv low-probabilit
divergence
ergence by choosingy areas bthe etwright
een mo des ofIfpthe
mode. . Here,
mo
modeswe
des
illustrate
are the outcome
not separated when qnis
by a sufficie
sufficien tlychosen
strongto lo
low
wemphasize
probabilit
probability the left mode.
y region, We could
then this alsoofhav
direction thee
ac hiev
KL div ed an
divergence equal value of the KL div
ergence can still choose to blur the mo ergence
modes.by
des. c hoosing the right mode. If the mo des
are not separated by a sufficiently strong low probability region, then this direction of the
KL divergence can still choose to blur the mo des.

76
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

variables as a pro
product
duct of probability distributions ov
over
er tw
two
o variables:

variables as a product of pprobability


(a, b, c) = pdistributions
(a)p(b | a)p(cov| er
b).two variables: (3.52)

p(a, b, c) = p(a)p(b a)p(c b). (3.52)


These factorizations can greatly reduce the num numb ber of parameters needed
to describ
describee the distribution. Each factor uses | a num | ber of parameters that is
numb
exp These
exponen
onen
onential factorizations
tial in the numnumberber of variables in the factor. Thisber
can greatly reduce the num of parameters
means that we can needed
greatly
to describ
reduce theecostthe of distribution.
representing Each factor uses
a distribution if wae num ber of
are able to parameters that is
find a factorization
exp
intoonen
into tial in the num
distributions over ber
few of
fewerer vvariables
ariables.in the factor. This means that we can greatly
reduce the cost of representing a distribution if we are able to find a factorization
We can describ
describee these kinds of factorizations using graphs. Here we use the
into distributions over fewer variables.
word “graph” in the sense of graph theory: a set of ve vertices
rtices that may be connected
Weh can
to eac
each other describ
with eedges.
these kinds
Whenofwefactorizations
represent theusing graphs. Here
factorization we use the
of a probability
w ord “graph” with
distribution in thea sense
graph,ofwgraph
e call theory:
it a structura seteof
structure d prveob
probrtices that mo
obabilistic
abilistic may
modeldelbore connected
gr
graphic
aphic
aphical
al
to
mo eac
model
del
del..h other with edges. When we represent the factorization of a probability
distribution with a graph, we call it a structured probabilistic model or graphical
There are two main kinds of structured probabilistic mo models:
dels: directed and
model.
undirected. Both kinds of graphical mo dels use a graph G in which each no
models node
de
There are
in the graph corresp tw o main
corresponds kinds of structured probabilistic
onds to a random variable, and an edge connecting tw mo dels: directed and
twoo
undirected. Both kinds of graphical mo dels use a
random variables means that the probability distribution is able to represen graph in which each no de
representt direct
in
in the graph
interactions
teractions bet corresp
etw onds to a random
ween those two random variables. v ariable, and an Gedge connecting two
random variables means that the probability distribution is able to represent direct
Dir
Direecte
cted d mo
models
dels use graphs with directed edges, and they represen representt factoriza-
interactions between those two random variables.
tions into conditional probability distributions, as in the example ab abovov
ove.
e. Sp
Specifically
ecifically
ecifically,,
Dir ecte
a directed mo d mo
model dels
del con use
contains graphs with directed edges,
tains one factor for every random variable and they
ariablex represen t factoriza-
xi in the distribution,
tions into conditional probability distributions,
and that factor consists of the conditional distribution ov as in the example
over
er x i givab
givenenovthe
e. Sp ecifically
paren ts of ,
parents
xaidirected
, denotedmo Pdel
a G (con
xi ):tains one factorY for every random variable x in the distribution,
and that factor consists of the conditional distribution over x given the parents of
p(x) = p (xi | P aG (xi )) . (3.53)
x , denoted P a (x ):
i
p (x ) = p (x P a (x )) . (3.53)
See Fig. 3.7 for an example of a directed graph and the factorization of probability
distributions it represen
represents. ts. |
See Fig. 3.7 for an example of a directed graph and the factorization of probability
Undir
Undireectectedd mo
models
dels use graphs with undirected edges, and they represen representt fac-
distributions it represents.
torizations in into
to a set of functions; unlik Y
unlikee in the directed case, these functions are
usually not probability distributions ofundirected
Undir ecte d mo dels use graphs with any kind. edges,An
Anyy setandof they
no desrepresen
nodes that aret fac-
all
torizations to
connected into
eacha set of functions;
other in G is called unlik a eclique.
in the Each
directed case,
clique C (ithese
) in an functions
undirectedare
usually
mo del isnot
model asso probability
ciated withdistributions
associated a factor φ(i) (of C (iany
) kind. factors
). These Any setare of just
nodes that arenot
functions, all
connectedytodistributions.
probabilit
probability each other inThe is output
called aofclique. Each clique
each factor in an undirected
must be non-negative, but
mo del is asso ciated with a Gfactor φ ( ) . These factors
there is no constraint that the factor must sum or integrate to 1 like a probability are
C just functions, not
probability distributions. The outputCof each factor must be non-negative, but
distribution.
there is no constraint that the factor must sum or integrate to 1 like a probability
The probability of a configuration of random variables is pr propop
oportional
ortional to the
distribution.
pro
product
duct of all of these factors—assignments that result in larger factor values are
The probability of a configuration of random variables is proportional to the
product of all of these factors—assignments 77 that result in larger factor values are
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

a b

c d

Figure 3.7: A directed graphical mo model


del ov
over
er random variables a , b, c , d and e. This graph
corresp
corresponds
onds to probabilit
probability y distributions that can b e factored as
Figure 3.7: A directed graphical model over random variables a , b, c , d and e. This graph
p(a, b, cy, d
corresp onds to probabilit , e) = p(a)p(b
distributions | a)can
that p(c |bae, b )p(d | b)as
factored p(e | c). (3.54)

This graph allo


allows p(ato
ws us , b,quickly
c, d, e) =
seep(some
a)p(b properties
a)p(c a, b
of)pthe
(d distribution.
b)p(e c). For example, a
(3.54)
and c in
interact
teract directly
directly,, but a and e in interact
teract
| only indirectly
| |via c. |
This graph allows us to quickly see some properties of the distribution. For example, a
and c interact directly, but a and e interact only indirectly via c.
more liklikely
ely
ely.. Of course, there is no guarantee that this pro product
duct will sum to 1. We
therefore divide by a normalizing constant Z, defined to be the sum or integral
omore likstates
ver all ely. Ofofcourse,
the pro there
ductisofno
product theguarantee
φ functions,thatinthis proto
order duct will sum
obtain to 1. We
a normalized
therefore
probabilit
probability ydivide by a normalizing constant Z, defined to be the sum or integral
distribution:
over all states of the product of the φ functions, in order to obtain a normalized
probability distribution: 1 Y (i)  (i) 
p(x) = φ C . (3.55)
Z
1 i
p(x) = φ . (3.55)
See Fig. 3.8 for an example of anZundirectedC graph and the factorization of
probabilit
probability y distributions it represen
represents.ts.
See Fig. 3.8 for an example of an undirected graph and the factorization of
Keep in mind that these graphical representations of factorizations are a
probability distributions it represents. Y  
language for describing probability distributions. They are not mutually exclusive
Keepofin
families mind that
probabilit
probability these graphical
y distributions. Beingrepresentations of factorizations
directed or undirected is not a prop are
propert
ert
ertyay
language for describing probability
of a probability distribution; it is a prop distributions.
property They are not m utually exclusive
erty of a particular description of a
families
probabilit
probabilityof probabilit y distributions.
y distribution, but an any Being directed
y probability distribution or undirected is not aedprop
may be describ
described in bert
othy
wofays.
a probability distribution; it is a property of a particular description of a
probability distribution, but any probability distribution may be described in both
Throughout Part I and Part II of this book, we will use structured probabilistic
ways.
mo
models
dels merely as a language to describ describee which direct probabilistic relationships
Throughout
differen
differentt mac
machine P art I and P art I
hine learning algorithms I of choose
this book, we will useNostructured
to represent. probabilistic
further understanding
mostructured
of dels merelyprobabilistic
as a language mo to describ
models
dels is needede which
untildirect
until probabilistic
the discussion relationships
of researc
researchh topics,
differen t mac hine learning algorithms choose to represent.
in Part III, where we will explore structured probabilistic mo No further
models understanding
dels in muc
much h greater
of structured
detail. probabilistic mo dels is needed until the discussion of researc h topics,
in Part III, where we will explore structured probabilistic models in much greater
detail. 78
CHAPTER 3. PROBABILITY AND INFORMATION THEORY

a b

c d

Figure 3.8: An undirected graphical model ov over


er random variables a , b, c, d and e . This
graph corresp
corresponds
onds to probabilit
probability y distributions that can b e factored as
Figure 3.8: An undirected graphical model over random variables a , b, c, d and e . This
graph corresp onds top(probabilit 1 (1)
a, b, c, d, ey) =
distributions
φ (a, b,that can
c)φ(2) (b,bde)φfactored
(3)
(c, e). as (3.56)
Z
1
This graph allo ws usp(to
allows a, quickly
b, c, d, e)see φ properties
= some (a, b, c)φ of (bthe
, d)φdistribution.
(c, e). (3.56)
For example, a
Z
and c in
interact
teract directly
directly,, but a and e in interact
teract only indirectly via c.
This graph allows us to quickly see some properties of the distribution. For example, a
and c interact directly, but a and e interact only indirectly via c.
This chapter has reviewed the basic concepts of probabilit
probability
y theory that are
most relev
relevant
ant to deep learning. One more set of fundamen
fundamental
tal mathematical to
tools
ols
This chapter has reviewed
remains: numerical metho
methods.
ds. the basic concepts of probability theory that are
most relevant to deep learning. One more set of fundamental mathematical tools
remains: numerical methods.

79
Chapter 4
Chapter 4
Numerical Computation
Numerical
Mac
Machine
Computation
hine learning algorithms usually require a high amoun
amountt of numerical compu-
tation. This typically refers to algorithms that solve mathematical problems by
Mac
methohine
methods learning
ds that algorithms
update estimatesusually require avia
of the solution high
an amoun t of
iterative numerical
pro
process, compu-
cess, rather than
tation. This typically refers
analytically deriving a formformulato algorithms that solve mathematical problems
ula providing a symbolic expression for the correct so- by
metho ds that
lution. Common opupdate estimates
operations of the solution via an iterative pro cess, rather
erations include optimization (finding the value of an argument than
analytically deriving a formulaa providing
that minimizes or maximizes function) aand
symbolic
solvingexpression
systems offor the correct
linear so-
equations.
lution.
Ev
Even Common
en just ev operations
evaluating
aluating include optimization
a mathematical function on a(finding
digital the value of
computer cananbargument
e difficult
that minimizes or
when the function inv maximizes
involv
olv
olves a function)
es real num
numbers, and solving
bers, whic
which systems of linear equations.
h cannot be represented precisely
Ev en just evaluating a mathematical
using a finite amount of memorymemory.. function on a digital computer can be difficult
when the function involves real numbers, which cannot be represented precisely
using a finite amount of memory.
4.1 Ov
Overflo
erflo
erflow
w and Underflo
Underflow
w

4.1 fundamen
The Overflo
fundamental tal w and in
difficulty Underflo
performingwcontin
continuous
uous math on a digital computer
is that we need to represent infinitely many real num numb bers with a finite num number
ber
Thebitfundamen
of patterns.tal This
difficulty
means in pthat
erforming continall
for almost uous
realmath
num on
numbers,a digital
bers, w computer
wee incur some
is thatximation
appro we neederror
approximation to represent
when weinfinitely
representmany
represen t the nreal
um
umb bnum
er inbers
thewith a finiteIn
computer. num
manber
many y
of bit patterns. This means that for almost all real
cases, this is just rounding error. Rounding error is problematic, espnum bers, w e incur
especially some
ecially when
appro ximation error
it compounds across man when
many w
y ope erations, and can cause algorithms that In
represen
operations, t the n um b er in the computer. maniny
work
cases, this
theory is just
to fail rounding
in practice error.are
if they Rounding errortois minimize
not designed problematic, especially
the accum
accumulation when
ulation of
it compounds
rounding error.across many operations, and can cause algorithms that work in
theory to fail in practice if they are not designed to minimize the accumulation of
One form
rounding of rounding error that is particularly dev
error. devastating
astating is underflow. Under-
flo
floww occurs when num numbers
bers near zero are rounded to zero. Many functions behav ehavee
One
qualitativform
qualitatively of rounding
differen
ely differentlytly error
when that
their is particularly
argument is dev
zero astating
rather is
than underflow
a small .
p Under-
ositive
flo
n w boer.
um
umb ccurs
Forwhen numbers
example, near zero
we usually wanare
want rounded
t to av
avoid to zero. by
oid division Manyzerofunctions
(some softwbehav
aree
software
qualitatively differently when their argument is zero rather than a small positive
number. For example, we usually want80 to avoid division by zero (some software

80
CHAPTER 4. NUMERICAL COMPUTATION

en
environmen
vironmen
vironments ts will raise exceptions when this occurs, others will return a result
with a placeholder not-a-n not-a-num um
umb ber value) or taking the logarithm of zero (this is
en vironmen ts will
usually treated as −∞, whic raise exceptions
which h then when
becomes thisnot-a-n
occurs,um
not-a-num umbothers
ber if will
it is return
used for a result
many
with a placeholder
further arithmetic op not-a-n um
operations).
erations). b er value) or taking the logarithm of zero (this is
usually treated as , which then becomes not-a-number if it is used for many
Another highly damaging form of numerical error is overflow. Overflo Overflow w occurs
further arithmetic −∞ operations).
when num umbers
bers with large magnitude are appro ximated as ∞ or −∞. Further
approximated
Anotherwill
arithmetic highly damaging
usually changeform theseofinfinite
numerical error
values intois overflow
not-a-num
not-a-number. Overflo w occurs
ber values.
when numbers with large magnitude are approximated as or . Further
One example of a function that must be stabilized against underflow and
arithmetic will usually change these infinite values into not-a-num ∞ ber
−∞ alues.
v
overflo
erflow w is the softmax function. The softmax function is often used to predict the
One example
probabilities asso of a function
associated
ciated that must distribution.
with a multinoulli be stabilizedThe against
softmaxunderflow
function andis
odefined
verflowto is bthe
e softmax function. The softmax function is often used to predict the
probabilities associated with a multinoulli distribution. exp(
exp(x xi ) The softmax function is
softmax(x)i = Pn . (4.1)
defined to be j=1 exp(
exp(x xj )
exp(x )
Consider what happ happensens whensoftmax( x) x= are equal to some
all of the . constantt c. Analytically (4.1),
i exp(x ) constan Analytically,
we can see that all of the outputs should be equal to 1n . Numerically Numerically,, this may
Consider what happ ens when all of the x
not occur when c has large magnitude. If c is very negativ are equal to some constan
negative, t c. Analytically
e, then exp((c) will,
exp
w e can see
underflo
underflow. that means
w. This all of the
the outputs
denominator should be equal
of the softmax to will. Numerically
become 0, so , this
the may
final
not occur when c has large
result is undefined. When c is very large P magnitude. If c is
and positiv v ery
ositive, negativ
e, exp e,
exp((c) will ovthen
overfloexp
erflo
erflow, (c ) will
w, again
underflow.inThis
resulting means the denominator
the expression as a whole being of theundefined.
softmax will Both become
of these 0, so the final
difficulties
result is undefined.
can be resolved by instead ev When c is very
evaluating large and p ositiv e, exp (c ) will
softmax((z ) where z = x − max i xi . Simple
aluating softmax ov erflo w, again
resultingshows
algebra in the expression
that the valueasofathe whole beingfunction
softmax undefined. is notBoth of these
changed difficulties
analytically by
can be resolved
adding or subtractingby instead evaluating
a scalar from the softmax
input (vector.
z ) where z = x max
Subtracting maxx x. Simple
results
i i
algebra
in shows argument
the largest that the valueto expof bthe softmax
eing 0, whic
whichfunction
h rules out is not thechanged − analytically
possibility of ovoverflo
erflo by
erflow.
w.
adding
Lik
Likewise, or subtracting a scalar from the input vector.
ewise, at least one term in the denominator has a value of 1, which rules outSubtracting max x results
in the largest argument
the possibility of underflow to exp being
in the 0, which rules
denominator out the
leading to apdivision
ossibilityby of zero.
overflow.
Likewise, at least one term in the denominator has a value of 1, which rules out
There is still one small problem. Underflow in the numerator can still cause
the possibility of underflow in the denominator leading to a division by zero.
the expression as a whole to ev evaluate
aluate to zero. This means that if we implement
There
log softmax is still one small problem.
softmax((x) by first running the softmax Underflow in the numerator
subroutine then passing can thestill cause
result to
the expression
the log function, as waewhole
could to evaluate to
erroneously zero. −∞
obtain This means that
. Instead, if we implement
we must implement
alogseparate
softmaxfunction
(x) by first thatrunning the softmax
calculates log softmax subroutine then passing
in a numerically stable thewaresult
way y. Theto
the softmax
log log function,functionwe could
can beerroneously
stabilized using obtain the same . Instead,
trick aswe we must
used implement
to stabilize
a separate function
the softmax function. that calculates log softmax −∞in a n umerically stable way. The
log softmax function can be stabilized using the same trick as we used to stabilize
For the most part, we do not explicitly detail all of the numerical considerations
the softmax function.
in
inv
volv
olved
ed in implementing the various algorithms describ described ed in this book. Developers
F or
of low-lev the
low-level most part, we do not explicitly detail
el libraries should keep numerical issues in mind all of the numerical
when considerations
implementing
inv olv ed in implementing the v arious algorithms
deep learning algorithms. Most readers of this book can simply rely describ ed in this b o ok. Developers
on low-
of
lev low-lev
level el libraries should k eep numerical issues in
el libraries that provide stable implementations. In some cases, it is possible mind when implementing
deep
to learning aalgorithms.
implement new algorithm Most andreaders
hav
havee the of this
new b ook can simplyautomatically
implementation rely on low-
level libraries that provide stable implementations. In some cases, it is possible
to implement a new algorithm and hav81 e the new implementation automatically
CHAPTER 4. NUMERICAL COMPUTATION

stabilized. Theano (Bergstra et al., 2010; Bastien et al., 2012) is an example


of a softw
software
are pack
package
age that automatically detects and stabilizes man manyy common
stabilized. Theano ( Bergstra et al. , 2010; Bastien et al. , 2012) is an
numerically unstable expressions that arise in the context of deep learning.example
of a software package that automatically detects and stabilizes many common
numerically unstable expressions that arise in the context of deep learning.
4.2 Poor Conditioning

4.2 Poorrefers
Conditioning Conditioning
to how rapidly a function changes with resp respect
ect to small changes
in its inputs. Functions that change rapidly when their inputs are perturb erturbeded slightly
can be problematic for scientific computation because rounding errors in thechanges
Conditioning refers to how rapidly a function c hanges with resp ect to small inputs
in its inputs. Functions that change rapidly
can result in large changes in the output. when their inputs are p erturb ed slightly
can be problematic for scientific computation because rounding errors in the inputs
Consider the function f ( x ) = A−1x. When A ∈ R n×n has an eigenv eigenvalue
alue
can result in large changes in the output.
decomp
decomposition,
osition, its condition numb
numberer is R
Consider the function f ( x ) = A x  . When
 A has an eigenvalue
decomposition, its condition number is  λi  ∈
max   . (4.2)
i,j λj
λ
max . (4.2)
This is the ratio of the magnitude of the λlargest and smallest eigen eigenv value. When
this num
numbber is large, matrix inv
inversion
ersion is particularly sensitive to error in the input.
This is the ratio of the magnitude of the largest and smallest eigenvalue. When
This sensitivit
sensitivity
y is an in
intrinsic
trinsic prop
propert
ert
erty
y of the matrix itself, not the result
 
this number is large, matrix inversion is particularly sensitive to error in the input.
of rounding error during matrix inv 
inversion.
ersion. Poorly conditioned matrices amplify
This sensitivit  
pre-existing errorsywhen
is anwe
intrinsic
multiplyprop
by ert
they true
of the matrix
matrix inv itself,Innot
inverse.
erse. the result
practice, the
of rounding
error error
will be comp during further
compounded
ounded matrix binv
y nersion.
umerical Poorly
errorsconditioned
in the in
inv matrices
version pro amplify
process
cess itself.
pre-existing errors when we multiply by the true matrix inverse. In practice, the
error will be compounded further by numerical errors in the inversion process itself.
4.3 Gradien
Gradient-Based
t-Based Optimization

4.3 deep
Most Gradien
learningt-Based
algorithms Optimization
involv
involvee optimization of some sort. Optimization
refers to the task of either minimizing or maximizing some function f (x ) by altering
Most
x . WWeedeep learning
usually phrasealgorithms involve optimization
most optimization problems inofterms
some ofsort. Optimization
minimizing f (x).
refers to the
Maximization ma task
mayof either minimizing or maximizing some function f (x )
y be accomplished via a minimization algorithm by minimizing by altering
x
−.f (W xe). usually phrase most optimization problems in terms of minimizing f (x).
Maximization may be accomplished via a minimization algorithm by minimizing
f (The
x). function we wan antt to minimize or maximize is called the obje
objective
ctive function
or criterion. When we are minimizing it, we may also call it the cost function function,,
− The function we want to minimize or maximize is called the objective function
loss function, or err error
or function. In this book, we use these terms in interc
terc
terchangeably
hangeably
hangeably,,
or criterion . When we are minimizing it,
though some machine learning publications assign sp we may ecial meaning to somefunction
also
special call it the cost of these,
loss function, or error function. In this book, we use these terms interchangeably,
terms.
though some machine learning publications assign special meaning to some of these
terms.We often denote the value that minimizes or maximizes a function with a
sup erscript ∗. For example, we might say x∗ = arg min f (x).
superscript
We often denote the value that minimizes or maximizes a function with a
superscript . For example, we might say 82 x = arg min f (x).

CHAPTER 4. NUMERICAL COMPUTATION

Gradient descent
2.0

1.5 Gradient descentat x =0.


Global minimum
Since f (x) =0, gradient
0

descent halts here.


1.0

0.5

0.0
For x <0, we have f0(x) <0, For x >0, we have f 0(x) >0,
so we can decrease f by so we can decrease f by
−0.5 moving rightward. moving leftward.

−1.0
f(x) = 12 x2
−1.5
ff0((xx))=
= xx
−2.0
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 f (x)1.5
=x 2.0
x

Figure 4.1: An illustration of how the deriv


derivativ
ativ
atives
es of a function can b e used to follow the
function downhill to a minimum. This tectechnique
hnique is called gr
gradient
adient desc ent..
descent
ent
Figure 4.1: An illustration of how the derivatives of a function can b e used to follow the
function downhill to a minimum. This technique is called gradient descent.
We assume the reader is already familiar with calculus, but pro provide
vide a brief
review of how calculus concepts relate to optimization here.
We assume the reader is already familiar with calculus, but provide a brief
Supp
Suppose
review ofose
how wecalculus
hav function yrelate
havee a concepts = f (tox),optimization
where both x and y are real num
here. numbers.
bers.
0 dy
The derivative of this function is denoted as f ( x) or as dx . The deriv ative f 0 (x)
derivative
giv
givesSupp ose
es the slop we hav e a function y = f (x ), where
slopee of f (x) at the point x. In other words, it sp b oth x and y
ecifies hownum
are
specifies real bers.
to scale
aThe
smallderivative
change of in this
the function
input in is denoted
order as f ( x
to obtain ) orcorresp
the as .ondingThe deriv
corresponding ativeinf the
change (x)
output: x +e)of≈ff(x(x
gives thef (slop ) )at
+ the . x. In other words, it specifies how to scale
f 0(xp)oint
a small change in the input in order to obtain the corresponding change in the
The deriv
output: derivative
f (x + ative
) isf therefore
(x) + f (useful
x). for minimizing a function because it tells us
ho
howw to change x in order to make a small improv improvemen ementt in y . For example, we
emen
The deriv ative≈is therefore useful for minimizing a function because
kno
know 0
w that f(x −  sign (f ( x))) is less than f (x ) for small enough . We itcantells
thusus
how to fchange
reduce x in order
(x) by moving x intosmallmakestepsa smallwithimprov
opp
opposite emen
osite sign of ythe
t in . Fderiv
or example,
derivativ
ativ
ative. we
e. This
kno
tec w that
technique f ( x
hnique is called gr  sign (
gradientf (
adient descx))) is
descent less
ent (Caucthan
Cauch f (x ) for small enough . W
hy, 1847). See Fig. 4.1 for an example ofe can thus
reduce
this f (x) by −
technique. moving x in small steps with opposite sign of the derivative. This
technique is0 called gradient descent (Cauchy, 1847). See Fig. 4.1 for an example of
thisWhen f (x) = 00,, the deriv
technique. derivative
ative provides no information ab about
out which direction
0
to mov
move. e. Poin
Points ts where f (x) = 0 are known as critic criticalal points or stationary points.
A lo When f ( x) = 0 , the deriv ative
loccal minimum is a point where f ( x) is low provides no
lower er than at allabneighboring
information out which direction
poin
oints,
ts,
to mov
so it is e.
noPoin
longerts where
possible f (x to) = 0 are known
decrease f (x ) bas critical infinitesimal
y making points or stationary
steps. Apoints
loccal.
lo
A local minimum
maximum is a poin ist awhere
oint pointf (where f ( x) isthan
x ) is higher loweratthan at all
all neigh
neighb neighboring
boring poin
oints, poin
ts, so it ts,
is
so it is no longer possible to decrease f (x ) by making infinitesimal steps. A local
maximum is a point where f (x ) is higher than at all neighboring points, so it is
83
CHAPTER 4. NUMERICAL COMPUTATION

Types of critical points

Minimum Types Maximum


of critical points Saddle point

Minimum Maximum Saddle point

Figure 4.2: Examples of each of the three typ ypes


es of critical poin
oints
ts in 1-D. A critical point is
a p oint with zero slop
slope.
e. Such a p oin
ointt can either b e a lo
local
cal minimum, which is low lower
er than
Figure 4.2: Examples of each
the neighboring p oints, a lo of
local the three typ es of critical p oints
cal maximum, which is higher than the neigh in 1-D. A
neighb critical p oint or
b oring p oints, is
a saddle
p oint with zero slop e. Such a p oin t can either b e a lo cal
p oint, which has neighbors that are b oth higher and low minimum,
lower which is low
er than the p oin er than
ointt itself.
the neighboring p oints, a lo cal maximum, which is higher than the neighb oring p oints, or
a saddle p oint, which has neighbors that are b oth higher and lower than the p oint itself.
not possible to increase f( x) by making infinitesimal steps. Some critical points
are neither maxima nor minima. These are known as sadd saddlele points
oints.. See Fig. 4.2
not p ossible to increase f( x ) b
for examples of each type of critical point.y making infinitesimal steps. Some critical points
are neither maxima nor minima. These are known as sadd le points. See Fig. 4.2
A point that
for examples obtains
of each typethe of absolute
critical plow lowest
oint. est value of f ( x) is a glob global
al minimum
minimum.. It
is possible for there to be only one global minim minimum um or multiple global minima of
the A point that
function. It isobtains the absolute
also possible for therelowestto bvealue
lo
local
cal f ( x) is athat
of minima globare
al minimum . It
not globally
is possibleInforthe
optimal. there to beofonly
context deep one global minim
learning, um or multiple
we optimize functionsglobal
that minima
may ha havvofe
the
man
many function.
y lo
local It is also p ossible for
cal minima that are not optimal, and manthere to b e lo
manycal minima that are not
y saddle points surrounded by globally
optimal.
v In the context
ery flat regions. All of thisof deep
makeslearning,
optimization we optimize functions
very difficult, esp that may
especially
ecially whenhathe
ve
many to
input local
theminima
functionthat are not optimal, and
is multidimensional. many saddle
We therefore points
usually surrounded
settle for findingby a
v ery flat regions. All of this makes optimization very
value of f that is very low, but not necessarily minimal in any formal sense. Seedifficult, esp ecially when the
input4.3
Fig. to for
theanfunction
example. is multidimensional. We therefore usually settle for finding a
value of f that is very low, but not necessarily minimal in any formal sense. See
We often minimize functions that ha havve multiple inputs: f : R n → R. For the
Fig. 4.3 for an example.
concept of “minimization” to make sense, there must still be R only oneR (scalar)
W
output. e often minimize functions that ha v e m ultiple inputs: f : . For the
concept of “minimization” to make sense, there must still be only→one (scalar)
For functions with multiple inputs, we must mak makee use of the concept of partial
output. ∂
derivatives
derivatives.. The partial deriv derivativ
ativ
ativee ∂x i
f (x ) measures how f changes as only the
F or functions with multiple
variable xi increases at point x. The gr inputs, we
gradientmust mak e
adient generalizes theuse of the concept
notion of partial
of deriv
derivativ
ativ
ativee
derivatives . The partial
to the case where the deriv deriv
derivativ
ativ ativ e
ativee is with resp f (x )
respect measures how f
ect to a vector: the gradient of f is the
c hanges as only the
variable x increases at p oint
vector containing all of the partial deriv x . The gr adient
derivativ
ativ generalizes the notion
es, denoted ∇xf ( x). Elemen
atives, Elementt i of thee
of deriv ativ
to the case
gradien
gradient t is where the deriv
the partial derivativ
derivativ
ative eis of
ative with respect
f with resp toect
respect a vector:
to xi . Inthe gradient
multiple of f is the
dimensions,
vector containing all of the partial derivatives, denoted f ( x). Element i of the
gradient is the partial derivative of f with 84 resp ect to x∇ . In multiple dimensions,
CHAPTER 4. NUMERICAL COMPUTATION

Approximate minimization

This local minimum


Approximate minimization
performs nearly as well as
the global one,
so it is an acceptable
f(x) halting point.
Ideally, we would like
to arrive at the global
minimum, but this
might not be possible.
This local minimum performs
poorly, and should be avoided.

Figure 4.3: Optimization algorithms ma mayy fail to find a global minimum when there are
multiple lo
local
cal minima or plateaus present. In the context of deep learning, we generally
Figure such
accept 4.3: Optimization
solutions even algorithms
even though theyma y fail
are not to findminimal,
truly a globalso
minimum whencorresp
long as they there ond
are
correspond
m ultiple lo cal minima or plateaus present. In
to significantly low values of the cost function. the context of deep learning, we generally
accept such solutions even though they are not truly minimal, so long as they corresp ond
to significantly low values of the cost function.
critical points are points where ev every
ery element of the gradient is equal to zero.
The pdir
direectional derivative u
critical oints are points whereinevdirection
ery element (a of unit vector) is
the gradient is equal
the slop
slope
to ezero.
of the
function f in direction u. In other words, the directional deriv derivative
ative is the deriv
derivativ
ativ
ativee
The dir ectional derivative
of the function f ( x + αu) with resp in direction u (a
ect to α , ev
respect unit vector) is the slop e of
aluated at α = 00.. Using the chain
evaluated the
function
rule, in see
we fcan direction
that ∂α∂u. In other words,
f (x + αu) = u> the∇x fdirectional
(x). derivative is the derivative
of the function f ( x + αu) with respect to α , evaluated at α = 0. Using the chain
f , we would lik in which f decreases the
rule,Toweminimize
can see that f (x + like
αue)to= find
u the direction
f (x ).
fastest. We can do this using the directional deriv derivativ
ativ
ative:
e:
To minimize f , we would like to find∇the direction in which f decreases the
fastest. We can do this using the directional derivative:
min u>∇ x f (x) (4.3)
u,u > u=1

= min min ||u||u f (x)


2 ||∇ xf (x)||2 cos θ
(4.3)
(4.4)
>
u,u u=1 ∇
where θ is the angle betw = min u
een u and the gradient.
etween f (x) Substituting
cos θ in ||u||2 = 1 (4.4)
and
ignoring factors that do not dep depend || || ||∇ ||
end on u, this simplifies to minu cos θ. This is
where θ is the
minimized when angle
u b etw
poin
oints een u and
ts in the opp the
opposite gradient.
osite Substituting
direction in ut. =
as the gradien
gradient. and
In1other
ignoring
w ords, thefactors that
gradient dotsnot
poin
oints depend
directly on uand
uphill, , this
thesimplifies to min ||pcos
negative gradient ||θ. This
oints is
directly
minimized
do
downhill. e can udecrease
wnhill. Wwhen points finbythe opposite
moving in thedirection as ofthe
direction thegradien t. gradient.
negative In other
w ords, the gradient p oin
This is known as the metho ts
methoddirectly
d of steuphill,
steep
ep
epest
est descand
descentthe negative
ent or gr
gradient gradient
adient desc
descentp
ent. oints directly
downhill. We can decrease f by moving in the direction of the negative gradient.
ThisSteep
Steepest
est descent
is known as theprop
proposes
methooses a new
d of steepp oin
oint
est t ent or gradient descent.
desc
0
Steepest descent proposes axnew poin
=x − t∇x f (x) (4.5)

x = x 85 f (x) (4.5)


− ∇
CHAPTER 4. NUMERICAL COMPUTATION

where  is the le learning


arning rate
ate,, a positiv
ositivee scalar determining the size of the step. We
can cho ose  in sev
hoose several
eral differen
differentt wa ys. A popular approach is to set  to a small
ways.
where
constan 
constant. is the learning rate , a
t. Sometimes, we can solve forp ositiv e scalar determining
the step size thatthe size of
makes thethedirectional
step. We
can cative 
hoosevanish.in several differen t ways. isAto popular approach
f (x − is
∇to 
set to a small
deriv
derivative Another approach ev
evaluate
aluate xf (x)) for several
constan t. Sometimes, we can solve for the step
values of  and choose the one that results in the smallest ob size that makes
jectivethe
objective directional
function value.
derivative
This vanish. isAnother
last strategy called aapproach
line se
sear
ar
archis. to evaluate f (x  f (x)) for several
ch
ch.
values of  and choose the one that results in the smallest ob − jective
∇ function value.
Steep
Steepestest descen
descentt con
conv verges when every element of the gradient is zero (or, in
This last strategy is called a line search.
practice, very close to zero). In some cases, we may be able to avoid running this
Steep
iterativ
iterativee est descent and
algorithm, convjust
ergesjump
whendirectly
every element
to the of the gradient
critical point by is solving
zero (or,the
in
practice, very
equation ∇xf (close
x) = to zero).
0 for x. In some cases, we may be able to avoid running this
iterative algorithm, and just jump directly to the critical point by solving the
Although gradient descent is limited to optimization in contin
equation continuous
uous spaces, the
f (x) = 0 for x.
general concept of making small mo movves (that are appro
approximately
ximately the best small mo mov ve)
Although ∇ gradient descent is limited to optimization in continuous spaces, the
to
tow
wards better configurations can be generalized to discrete spaces. Ascending an
general
ob
objectiv
jectivconcept
jectivee function of making small
of discrete moves (that
parameters are appro
is called hilll ximately
hil climbing the best and
(Russel smallNorvig
move),
towards
2003 ). better configurations can be generalized to discrete spaces. Ascending an
2003).
ob jective function of discrete parameters is called hil l climbing (Russel and Norvig,
2003).
4.3.1 Bey
Beyond
ond the Gradien
Gradient:
t: Jacobian and Hessian Matrices

4.3.1 Bey
Sometimes weond
needthe Gradien
to find all of thet: Jacobian
partial deriv and
derivativ
ativ esHessian
atives of a functionMatrices
whose input
and output are both vectors. The matrix containing all suc such h partial deriv
derivatives
atives is
Sometimes
kno
knownwn as a we Jac need
obianto
Jacobian find .allSp
matrix
matrix. ofecifically
the partial
Specifically
ecifically, , if we deriv
hav
have ativ
e a es of a function
function f : R m whose
→ Rn, input
then
and output are b
the Jacobian matrix J ∈ Roth vectors.n×m The matrix containing all suc h partial
of f is defined such that Ji,j = Rj f (x) i .R ∂ deriv atives is
known as a Jacobian matrix. Specifically, if we have a function f ∂x : , then
Rin
the W e are also
Jacobian sometimes
matrix J interested
terested
of f is in defined
a deriv
derivativ
ativ
ative
such e of a deriv
that derivative.
Jn =ative. fThis
(x→) .is kno
known wn
as a se derivative.. For example, for a function f : R → R, the deriv
seccond derivative derivativ
ativ
ativee
W e are also sometimes ∈ interested in a derivative of a derivative. This is kno ∂ 2 wn
with resp ect to x i of the deriv
respect ativee of f with resp
derivativ
ativ ect to xRj is denoted
respect R as ∂xi ∂xj f .
as a second derivative. For example, for d2
a function
00
f : , the deriv ative
In a single dimension, we can
with respect to x of the derivative of dx denote f b y f ( x).
f with respect to x is→
2 The second deriv
derivativ
denoted as ativ
ativee tells
f.
us how the first deriv derivativ
ativ
ativee will change as we vary the input. This is imp important
ortant
Inecause
b a single dimension,
it tells us whetherwe can denote stepf will
a gradient by fcause(x). as Themucsecond
uchh of an deriv ativeemen
improv
improvemen tellst
ement
us we
as howwould
the first exp deriv
expect ative will
ect based on the change as wealone.
gradient vary the We input.
can thinkThisofisthe impsecond
ortant
b ecause
deriv
derivative it tells us whether
ative as measuring curvatur a gradient
curvaturee. Supp step
Suppose will
ose we havcause as m uc h of an improv
havee a quadratic function (many ement
as we would
functions thatexp ect in
arise based on the
practice are gradient
not quadratic alone.but Wecan canbethink of the second
approximated well
deriv ative as measuring
as quadratic, at least lo curvatur
locally).
cally). If suce . Supp
such ose we hav e a quadratic
h a function has a second deriv function
derivativ
ativ (many
ativee of zero,
functions that
then there is no curv arise in practice
curvature. are not quadratic but can b e
ature. It is a perfectly flat line, and its value can be predictedapproximated well
as quadratic, at least lo cally). If suc h a function
using only the gradient. If the gradient is 1, then we can mak has a second deriv ativ e of
makee a step of size  zero,
then there
along is no curv
the negative ature. Itand
gradient, is atheperfectly flat line,
cost function and
will its value
decrease bycan. Ifbethe
predicted
second
using
deriv only
derivative the gradient. If the gradient
ative is negative, the function curves down is 1 , then
downw ward, so the cost functionsize
w e can mak e a step of 
will
along thedecrease
actually negativeby gradient,
more than and the cost ,function
. Finally
Finally, will decrease
if the second deriv
derivativeby is
ative . Ifpositiv
the second
ositive,
e, the
deriv ative is
function curves upw negative, the function curves down w ard, so the
ard, so the cost function can decrease by less than . See Fig.
upward, cost function will
actually decrease by more than . Finally, if the second derivative is positive, the
function curves upward, so the cost function 86 can decrease by less than . See Fig.
CHAPTER 4. NUMERICAL COMPUTATION

Negativ
Negativee curv
curvature
ature No curv
curvature
ature Positiv
ositivee curv
curvature
ature

Negative curvature No curvature Positive curvature

f (x)
f (x)

f (x)
x x x

Figure 4.4: The second deriv


derivativ
ativ
ativee determines the curv
curvature
ature of a function. Here we show
quadratic functions with various curv curvature.
ature. The dashed line indicates the value of the cost
Figure 4.4:
function we The
wouldsecond
exp
expectderiv
ect ativon
based e determines
the gradienttheinformation
curvature of a function.
alone Herea we
as we make show
gradient
quadratic functions with various curv
step downhill. In the case of negative curvature. The
curvature,dashed line indicates the value of
ature, the cost function actually decreasesthe cost
function
faster we the
than would exp ectpredicts.
gradient based onIn the gradient
the case ofinformation
no curv
curvature,alone
ature, theas we make
gradient a gradient
predicts the
step downhill.
decrease In. the
correctly
correctly. casecase
In the of negative curv
of p ositive ature,
curv
curvature,
ature,the
thecost function
function actually
decreases decreases
slow
slowerer than
faster
exp than
expected
ected andthe gradient
even tually bpredicts.
eventually In the case
egins to increase, ofono
so to
too curv
large ofature, the gradient
step sizes predicts
can actually the
increase
decrease
the correctly
function . In the
inadverten tly..case of p ositive curvature, the function decreases slower than
inadvertently
tly
expected and eventually begins to increase, so too large of step sizes can actually increase
the function inadvertently.
4.4 to see how different forms of curv curvature
ature affect the relationship betw etween
een the value
of the cost function predicted by the gradient and the true value.
4.4 to see how different forms of curvature affect the relationship between the value
When our function has multiple input dimensions, there are many second
of the cost function predicted by the gradient and the true value.
deriv
derivatives.
atives. These deriv
derivatives
atives can be collected together into a matrix called the
Whenmatrix
Hessian our function
matrix. has multiple
. The Hessian matrix H input
(f )(xdimensions, therethat
) is defined such are many second
derivatives. These derivatives can be collected together into a matrix called the
Hessian matrix. The Hessian matrix H (f )(∂x2) is defined such that
H (f )(x)i,j = f (x). (4.6)
∂ x i ∂ xj

H (f )(x) = f (x). (4.6)
Equiv
Equivalently
alently
alently,, the Hessian is the Jacobian ∂ofx the ∂ x gradient.
An
Anywhere
ywhere that the second partial deriv derivativ
ativ
atives es are contin
continuous,
uous, the differential
Equivalently, the Hessian is the Jacobian of the gradient.
op
operators
erators are commutativ
commutative, e, i.e. their order can be swapped:
Anywhere that the second partial derivatives are continuous, the differential
operators are commutative, i.e. ∂ 2 their order can ∂ 2 be swapped:
f (x) = f (x). (4.7)
∂ x i ∂ xj ∂ x j ∂ xi
∂ ∂
f (x) = f (x). (4.7)
This implies that i,jH = H ∂j,ix, ∂sox the Hessian∂ x matrix
∂x is symmetric at such points.
Most of the functions we encoun
encounter ter in the context of deep learning ha havve a symmetric
This implies that H = H
Hessian almost everywhere. Because the Hessian matrix is real andsuch
, so the Hessian matrix is symmetric at points.
symmetric,
Most of the
we can decompfunctions
decompose we
ose it inencoun
into ter in the
to a set of real eigen context
eigenv of deep learning ha ve a symmetric
values and an orthogonal basis of
Hessian almost everywhere. Because the Hessian matrix is real and symmetric,
we can decompose it into a set of real87eigenvalues and an orthogonal basis of
CHAPTER 4. NUMERICAL COMPUTATION

eigen
eigenv vectors. The second deriv derivative
ative in a sp specific
ecific direction represented by a unit
vector d is giv givenen by d> H d. When d is an eigenv ector of H , the second deriv
eigenvector derivativ
ativ
ativee
eigen v ectors. The second
in that direction is given by the corresp deriv ative in a
corresponding sp ecific
onding eigenv direction
eigenvalue. represented
alue. For other directions by a unit
of
vector d is giv en by d
d , the directional second deriv H d . When
derivativ
ativ d is an
ativee is a weigh eigenv
weighted ector of H , the second
ted average of all of the eigen deriv
eigenv ativ
values, e
in that
with weigh direction
eightsts bet etwis given by the
ween 0 and 1, and eigenv corresp onding
eigenvectors eigenv
ectors that hav alue. F or other directions
havee smaller angle with d of
d , the directional
receiving more weigh second
eight. t. The deriv
maximativeum
maximum is aeigenv
weigh
eigenvalue ted determines
alue average of all theofmaximum
the eigensecondvalues,
with
deriv weighand
derivative
ative ts btheetwminim
een 0 um
minimum andeigenv
1, and
eigenvalue eigenv
alue ectors that
determines have smaller
the minim
minimum um second anglederivwith
derivative. d
ative.
receiving more weight. The maximum eigenvalue determines the maximum second
The (directional) second deriv derivative
ative tells us how well we can exp expect ect a gradient
derivative and the minimum eigenvalue determines the minimum second derivative.
descen
descentt step to perform. We can mak makee a second-order Taylor series appro approximation
ximation
Thefunction
to the (directional) second deriv
f (x) around ative tells
the current point us x how
(0) well we can exp ect a gradient
:
descent step to perform. We can make a second-order Taylor series approximation
to the function 1
f (x) f≈(xf)(xaround
(0)
) + (thex −current
x(0) )>gp+oint(xx − :x(0) )> H (x − x (0)). (4.8)
2
1
f ( x ) f ( x ) +
where g is the gradient and H is the Hessian ( x x ) g + (x x(0) ) H (x x ). (4.8)
2 at x . If we use a learning rate
of , then the new ≈ point x will − be given by x (0) − − g. Substituting − this into our
where
appro g is
approximation, the gradient
ximation, we obtain and H is the Hessian at x . If w e use a learning rate
of , then the new point x will be given by x g. Substituting this into our
approximation, wefobtain (0) (0) > − 1 2 >
(x − g) ≈ f (x ) − g g +  g H g. (4.9)
2
1
There are three terms f (x here: g) thef (original
x ) vgalue g +of the g H g.
function, the exp (4.9)
expected
ected
2
impro
improv vemen
ementt due to the slop − e of
slope ≈ the function, − and the correction we must apply
There are three
to account for the curv terms ature of the function. Whenofthis
curvature here: the original v alue thelastfunction,
term is to the
too expected
o large, the
impro
gradien v emen
gradientt descen t due to the slop
descentt step can actually mov e of the function, and the correction
>
movee uphill. When g H g is zero or negative, we must apply
to account for the
the Taylor series appro curv ature
approximation of the
ximation predicts function. thatWhen this last
increasing term
 forev
foreverer iswill
toodecrease
large, the f
gradien
forev er.t In
forever. descen t stepthe
practice, can Taactually
ylor series mov is eunlik
uphill.
unlikelyely toWhenremaing H g is zerofor
accurate orlarge
negative,
large , so
the T
one maustylorresort
seriestoappro
moreximation
heuristic predicts
choices ofthat this case. When
 inincreasing foreverg>will
H g decrease
is positiv fe,
ositive,
forever. for
solving In the
practice,
optimal thestep
Taylor sizeseries is unlikelythe
that decreases to Tremain accurate
aylor series for large , so
approximation of
one m ust resort to
the function the most yieldsmore heuristic choices of  in this case. When g H g is p ositiv e,
solving for the optimal step size that decreases g >g the Taylor series approximation of

the function the most yields  = . (4.10)
g> H g
g g
In the worst case, when g aligns with  =the eigenv . ector of H corresp
eigenvector corresponding
onding to (4.10)
the
g Hg 1
maximal eigenveigenvalue λ
alue max , then this optimal step size is given by λ max . To the
In thet worst
exten
extent that the case, when gwaligns
function e minimizewith the caneigenv
be appro ector of H corresp
approximated
ximated well bonding to the
y a quadratic
maximal eigenv
function, the eigen alue
eigenv λ ,ofthen
values this optimal
the Hessian thus step size isthe
determine given byof the. learning
scale To the
exten
rate. t that the function w e minimize can b e appro ximated well b y a quadratic
function, the eigenvalues of the Hessian thus determine the scale of the learning
The second deriv derivative
ative can be used to determine whether a critical point is a
rate.
lo
local
cal maximum, a lo local
cal minimum, or saddle point. Recall that on a critical point,
f 0(xThe
)=0 second
. When deriv
f 00(ative
x) > can be used
0, this means to that
determine whether aascritical
f 0(x) increases we mov movepeoint is a
to the
lo cal
righ maximum, a lo cal minimum,
t, and f 0 (x ) decreases as we mov
right, or saddle p oint. Recall that on
movee to the left. This means f 0 ( x − ) < 0 and a critical p oint,
f (x) = 0. When f (x) > 0, this means that f (x) increases as we move to the
right, and f (x ) decreases as we move to 88 the left. This means f ( x ) < 0 and


CHAPTER 4. NUMERICAL COMPUTATION

f 0(x + ) > 0 for small enough  . In other words, as we mov movee right, the slop slopee begins
to poin ointt uphill to the right, and as we mo mov ve left, the slop slopee begins to point uphill
f (the
to x + left.
) > 0Thus,for small
whenenoughf 0 (x ) =  .0Inandotherf 00(xw)ords,
> 0, as wewe canmov e right, that
conclude the slop
x iseabloegins
local
cal
to
minim p oin
minimum. t uphill to the right, and
0 as w e mo v e
00 left, the
Similarly,, when f ( x) = 0 and f (x) < 0, we can conclude that x is a
um. Similarly slop e b egins to p oint uphill
to
lo
localthe left.
cal maximum. Thus,This whenisfknown (x ) = as 0 and
the fsesec (cxond 0, we can test
) >derivative conclude
test. that x is a, when
. Unfortunately
Unfortunately, local
00 (x um. Similarly, when f ( x) = 0 and f (x)x < 0, we can conclude that x is a
fminim ) = 00,, the test is inconclusive. In this case ma may y be a saddle poin oint,t, or a part
lo cal maximum.
of a flat region. This is known as the se c ond derivative test . Unfortunately , when
f (x) = 0, the test is inconclusive. In this case x may be a saddle point, or a part
In multiple dimensions, we need to examine all of the second deriv derivatives
atives of the
of a flat region.
function. Using the eigendecomp eigendecomposition osition of the Hessian matrix, we can generalize
the In multiple
second derivdimensions,
derivativativee test we
ativ to need
multipleto examine
dimensions. all of the At second
a criticalderivpatives of the
oint, where
∇function. Using
0,, we thecan eigendecomp
examine the osition of the of Hessian matrix, we can generalize
x f( x) = 0 eigen
eigenv values the Hessian to determine whether
the second deriv
the critical point is a lo ativ e test
local to multiple
cal maximum, lo localdimensions. A t a
cal minimum, or saddle point. When critical p oint, where
the
f( x ) =
Hessian is positiv 0 , w e can examine the
ositivee definite (all its eigenv eigen v alues
eigenvalues of the Hessian
alues are positive), the poin to determine whether
ointt is a lo local
cal
the critical

minim
minimum. um. This pointcan is ablo e cal
seen maximum,
by observing localthat minimum, or saddlesecond
the directional point. deriv
Whenativ
derivativthee
ative
Hessian
in is positiv
any direction e definite
must (all itsand
be positive, eigenv makingalues reference
are positive), to thethe poin
univ
univariatet is second
ariate a local
minim
deriv
derivativeum.test.
ative ThisLik can
Likewise,be seen
ewise, when bythe observing
Hessian that the directional
is negativ
negative e definite (all second derivalues
its eigenv ative
eigenvalues
in any direction must
are negative), the point is a lo b e p ositive,
local and
cal maxim
maximum. making reference to the univ
um. In multiple dimensions, it is actually ariate second
pderiv ativetotest.
ossible find Lik ewise, evidence
positive when theofHessian saddle ispoin negativ
ointsts inesomedefinite (all When
cases. its eigenvat alues
least
are negative),
one eigenv
eigenvalue the p
alue is positiv oint is a lo cal maxim
ositivee and at least one eigenv um. In multiple
eigenvalue dimensions,
alue is negative, we kno it is actually
know w that
p ossible
x is a lo localto find p ositive evidence of
cal maximum on one cross section of f but a lo saddle p oin ts in some
local cases. When
cal minimum on another at least
one eigenv alue is p ositiv e and
cross section. See Fig. 4.5 for an example. Finally at least one eigenv alue is negative,
Finally,, the multidimensional second we kno w that
x is aative
deriv
derivativelocaltest maximum
can be on one crosse,section
inconclusiv
inconclusive, just like of the
f but univ a ariate
local minimum
univariate version. The on another
test is
cross section.
inconclusiv
inconclusivee whenev See
whenever Fig. 4.5 for an example.
er all of the non-zero eigenv Finally
eigenvalues , the
alues ha hav multidimensional
ve the same sign, but at second
deriv ative
least one eigenv test
eigenvaluecan b e inconclusiv e, just
alue is zero. This is because the univ like the univ
univariate ariate
ariate second version.
deriv The test
derivative
ative test isis
inconclusiv
inconclusiv e whenev er all
inconclusivee in the cross section corresp of the non-zero
corresponding eigenv alues
onding to the zero eigenv ha ve the same
eigenvalue.
alue. sign, but at
least one eigenvalue is zero. This is because the univariate second derivative test is
In multiple dimensions, there can b e a wide variety of different second deriv derivatives
atives
inconclusive in the cross section corresponding to the zero eigenvalue.
at a single point, because there is a different second deriv derivative
ative for each direction.
In multiple
The condition num dimensions,
numb there can b e a
ber of the Hessian measures how muc wide v ariety of different
much second deriv
h the second derivativ
atives
derivativ
atives
es
at
vary a single p oint, b ecause there
ary.. When the Hessian has a poor condition num is a different second
number, deriv
ber, gradienative for each
gradientt descent performsdirection.
pThe
oorly condition
orly.. This is num ber of
because in theoneHessian
direction, measures
the deriv how
derivative
ative muc h the second
increases rapidly
rapidly, deriv ativin
, while es
vary. When
another the Hessian
direction, has a slowly
it increases poor condition
slowly. . Gradientnum ber, gradien
descent is unaw
unawaretare
descent
of this performs
change
p o orly .
in the deriv This
derivativ is
ativ b ecause
ativee so it do in
does one direction, the deriv ative increases
es not know that it needs to explore preferentially in rapidly , while in
another direction,
the direction where the deriv it increases
derivativ slowly
ativ . Gradient
ativee remains negativ descent is unaw are
negativee for longer. It also mak of this change
makes es it
in the deriv
difficult ative so
to choose a goit
goo odo
d es
stepnot know
size. The thatstep it needs
size m to
ust bexplore
e small preferentially
enough to av in
avoid
oid
othe
vershodirection
ershooting
oting the where the deriv
minimum ativgoing
and e remains uphill negativ e for longer.
in directions with Itstrong
also mak es it
positive
difficult
curv
curvature.
ature. to This
choose a goodmeans
usually step size.
that Thethe step step sizesize ismto usto b
too e small
small enough
to make to avoid
significant
ov ersho oting the minimum
progress in other directions with less curv and going uphill
curvature. in directions with
ature. See Fig. 4.6 for an example. strong positive
curvature. This usually means that the step size is too small to make significant
This issue
progress in other candirections
be resolved with bylessusingcurv information
ature. See Fig. from4.6 theforHessian
an example.matrix to

This issue can be resolved by using information from the Hessian matrix to
89
CHAPTER 4. NUMERICAL COMPUTATION

500

f(x1 ;x2 )
0
−500

15
−15 0 x2
x1 0 −15
15

Figure 4.5: A saddle p oint containing b oth p ositive and negative curv curvature.
ature. The function
2 2
in this example is f (x ) = x1 − x 2. Along the axis corresp corresponding
onding to x1, the function
Figure
curv
curves 4.5: ard.
es upw A saddle
upward. This paxis
ointiscontaining
an eigenv bector
eigenvectoroth pofositive and negative
the Hessian and hascurv
a ature.
p ositiveThe function
eigenv
eigenvalue.
alue.
in this the
Along example is f (xonding
axis corresp ) = x to x
corresponding x2 ,. the
Along the axis
function curv corresp
curveses downonding
downward. to x direction
ward. This , the function
is an
curv
eigenes
eigenv upward.
vector of theThis axis with
Hessian is an− eigenvector
negative of alue.
eigenv the Hessian
eigenvalue. The name and“saddle
has a ppositive eigenvfrom
oint” derives alue.
Along
the the axis corresp
saddle-like shap
shapee ofonding x
to , theThis
this function. function
is thecurv
quinestessential
downward.
quintessential This direction
example is an
of a function
eigenvaector
with saddleof pthe Hessian
oint. In morewith negative
than eigenvalue.
one dimension, The
it is notname “saddle
necessary to pha
oint”
hav derives
ve an eigen
eigenvvfrom
alue
the saddle-like shap e of this function. This is the
of 0 in order to get a saddle point: it is only necessary to hav quin tessential example of a function
havee both positive and negative
with vaalues.
eigen
eigenv saddleWpeoint. In more
can think of athan onepdimension,
saddle oint with bitothis signs
not necessary
of eigenv to
eigenvalueshavas
alues e an eigen
being a vlo
alue
local
cal
of 0 in um
maxim
maximum order to getone
within a saddle point: it
cross section andis only
a lo necessary
local
cal minim
minimum to hav
um e both
within positive
another andsection.
cross negative
eigenvalues. We can think of a saddle point with both signs of eigenvalues as being a local
maximum within one cross section and a lo cal minimum within another cross section.

90
CHAPTER 4. NUMERICAL COMPUTATION

20

10
x2

−10

−20

−30
−30 −20 −10 0 10 20
x1

Figure 4.6: Gradient descent fails to exploit the curv curvature


ature information contained in the
Hessian matrix. Here we use gradient descent to minimize a quadratic function f ( x) whose
Figure 4.6:
Hessian matrixGradient descent num
has condition failsber
number to exploit
5. This the curvthat
means ature theinformation
direction ofcontained
most curv in the
curvature
ature
Hessian matrix. Here
has five times more curv we use
curvature gradient descent to minimize
ature than the direction of least curv a quadratic
curvature. function f ( x) whose
ature. In this case, the most
Hessian
curv aturematrix
curvature hasdirection
is in the condition[1,,num
[1 1]> ber
and 5.theThis
leastmeans
curv that isthe
curvature
ature in direction of most
the direction [1, −curv
1]> .ature
The
has lines
red five times morethe
indicate curv ature
path thanedthe
follow
followed by direction
gradient of least curv
descent. ature.
This veryInelongated
this case,quadratic
the most
curvatureresembles
function is in the direction [1 , 1]on.and
a long cany
canyon. the least
Gradient curvature
descent is intime
wastes the direction
rep
repeatedly
eatedly[1, descending
1] . The
red yon
can lineswindicate
canyon the path
alls, b ecause they follow
are the ed steep
by gradient
steepest descent.
est feature. This the
Because verystep
elongated quadratic
size is −somewhat
function
to
tooo large,resembles a long cany
it has a tendency to on.
ov Gradient
overshoot
ershoot thedescent
b ottomwastes
of the time rep eatedly
function and thus descending
needs to
can yon w alls,
descend the opp b ecause
opposite they
osite cany
canyon are the steep est feature. Because the step
on wall on the next iteration. The large p ositive eigenv size is somewhat
eigenvalue
alue
to othe
of large, it hascorresp
Hessian a tendency
ondingtotoovthe
corresponding ershoot
eigenv the
eigenvector b ottom
ector p oin of the
ointed
ted function
in this and indicates
direction thus needs to
that
descend
this the oppderiv
directional ositeativ
cany
derivativ
ative onrapidly
e is wall onincreasing,
the next iteration. The large algorithm
so an optimization p ositive eigenv
basedalue
on
of the
the Hessian
Hessian corresp
could onding
predict thattothe thesteep
eigenv
estector
steepest p ointed
direction in this
is not direction
actually indicates
a promising that
search
this directional
direction in this deriv ative is rapidly increasing, so an optimization algorithm based on
context.
the Hessian could predict that the steep est direction is not actually a promising search
direction in this context.

91
CHAPTER 4. NUMERICAL COMPUTATION

guide the search. The simplest metho method d for doing so is known as Newton
Newton’s
’s metho
methodd.
Newton’s metho
method d is based on using a second-order Taylor series expansion to
guide ximate
appro the search.
approximate f (x)The
nearsimplest
some poinmetho
oint d :for doing so is known as Newton’s method.
t x(0)
Newton’s method is based on using a second-order Taylor series expansion to
approximate(0) f (x) near some point x : 1
f (x) ≈ f (x )+(x−x(0) ) >∇ x f (x(0))+ (x−x (0))> H (f )(x(0) )( )(xx−x(0)). (4.11)
2
1
Iffwe
(x)then
f (solve
x )+ for )
(xthex critical pf oint
(x of)+this (x x ) H (f )(x )(x x ). (4.11)
2 function, we obtain:
≈ − ∇ − −
If we then solve for thex∗critical
= x(0) p−ointH (of
f )(this )−1 ∇x f (xwe
x(0) function, (0) obtain:
). (4.12)

When f is a positive xdefinite = x quadraticH (f )(xfunction,


) fNewton’s
(x ). metho method (4.12)
d consists of
applying Eq. 4.12 once to jump to − the minimum of ∇the function directly
directly.. When f is
When f is a p ositive definite
not truly quadratic but can be lo quadratic
locally function, Newton’s
cally approximated as a positive definitemetho d consists
quadratic,of
applying Eq.
Newton’s method4.12 consists
once to jump to the Eq.
of applying minimum
4.12 mof the function
ultiple directlyely
times. Iterativ . When
Iteratively up f is
updating
dating
not truly
the quadratic but
approximation and can be locally
jumping to theapproximated
minimum ofasthe a pappro
ositive definite quadratic,
approximation
ximation can reach
Newton’s method
the critical point muc consists
much h faster than gradient descent would. This is a useful up
of applying Eq. 4.12 m ultiple times. Iterativ ely dating
prop
propert
ert
erty
y
the approximation
near a lolocal and jumping to the
cal minimum, but it can be a harmful prop minimum of the
property appro ximation can
erty near a saddle point. As reach
the criticalinpoint
discussed Sec.muc h faster
8.2.3 than gradient
, Newton’s metho
method ddescent
is onlywould. This iswhen
appropriate a useful
theprop erty
nearby
near a lo
critical caltminimum,
poin
oint is a minimum but(all
it can
the beigen
e a harmful
eigenv values of prop erty near
the Hessian areapsaddle
ositive),point.
whereasAs
discussed
gradien in Sec.
gradientt descen
descent t is8.2.3 , Newton’s
not attracted to metho
saddledpoints
is only appropriate
unless the gradientwhen the nearby
points tow
toward
ard
critical
them. p oin t is a minimum (all the eigen values of the Hessian are p ositive), whereas
gradient descent is not attracted to saddle points unless the gradient points toward
Optimization algorithms such as gradient descen
them. descentt that use only the gradien
gradientt are
called first-or
first-orderder optimization algorithms
algorithms.. Optimization algorithms such as New-
Optimization
ton’s metho
method algorithms such
d that also use the Hessian as gradient
matrixdescen t thatse
are called use
sec onlyder
cond-or
ond-orderthe optimization
gradient are
called first-or
algorithms (No der
Nocedal optimization
cedal and Wright algorithms
, 2006). . Optimization algorithms such as New-
ton’s method that also use the Hessian matrix are called second-order optimization
The optimization
algorithms (Nocedal and algorithms
Wright, 2006 emplo
employ
). yed in most contexts in this book are
applicable to a wide variety of functions, but come with almost no guaran guarantees.
tees. This
The optimization algorithms emplo yed in most contexts
is because the family of functions used in deep learning is quite complicated. in this book are In
applicable
man
many to a wide variety of functions, but come with almost
y other fields, the dominant approach to optimization is to design optimization no guaran tees. This
is because the
algorithms for afamily
limited of family
functions used in deep learning is quite complicated. In
of functions.
many other fields, the dominant approach to optimization is to design optimization
In the context of deep learning, we sometimes gain some guarantees by restrict-
algorithms for a limited family of functions.
ing ourselv
ourselveses to functions that are either Lipschitz continuous or ha havve Lipsc
Lipschitz
hitz
con In
contin
tin the
tinuous context
deriv
uous derivativativof deep
atives. A learning,
Lipsc
es. Lipschitz we sometimes
contin
hitz continuousuous gain
function some
is a guarantees
function f by
whoserestrict-
rate
ingchange
of ourselvisesbounded
to functionsby a that are either
Lipschitz constantLipschitz
L: continuous or have Lipschitz
continuous derivatives. A Lipschitz continuous function is a function f whose rate
of change is bounded by ∀xa, ∀Lipschitz
y, |f (x) −constant
f (y)| ≤ L|| : x − y ||2 . (4.13)
L
This prop
propert
ert
ertyy is usefulxb, ecause
y, f (xit) allo
f (ws
allowsy) us to quantify
x y .our assumption that (4.13)a
small change in the input ∀ made
∀ | by an − algorithm
| ≤ L|| such−as ||gradien
gradientt descen
descentt will hav
havee
This prop ert y is useful b ecause
a small change in the output. Lipschitz con it allo ws us
contin
tin to
tinuit
uit
uityquantify our assumption
y is also a fairly weak constrain that
constraint, a
t,
small change in the input made by an algorithm such as gradient descent will have
a small change in the output. Lipschitz 92 continuity is also a fairly weak constraint,
CHAPTER 4. NUMERICAL COMPUTATION

and many optimization problems in deep learning can be made Lipschitz con contin
tin
tinuous
uous
with relatively minor mo modifications.
difications.
and many optimization problems in deep learning can be made Lipschitz continuous
Perhaps the most successful field of sp specialized
ecialized optimization is convex optimiza-
with relatively minor modifications.
tion
tion.. Conv
Convexex optimization algorithms are able to provide many more guarantees
P erhaps the mostrestrictions.
by making stronger successful field
Conofvsp
Conv execialized optimization
optimization algorithmsis convex optimiza-
are applicable
tion. to
only Conv
convexexoptimization
convex algorithms
functions—functions forare ablethe
which to provide
Hessian many more
is positiv
ositive guarantees
e semidefinite
b
evyerywhere.
making stronger
everywhere. Suc
Such restrictions.
h functions Convex optimization
are well-behav
well-behaved ed because they algorithms are papplicable
lack saddle oints and
only to conv
all of their loex
localfunctions—functions for which the
cal minima are necessarily global minima. HoHessian Howis p
wevositiv
ever, e semidefinite
er, most problems
ev erywhere. Suc h functions are well-behav ed b
in deep learning are difficult to express in terms of convecause they
convex lack saddle points
ex optimization. Convand
Convex
ex
all of their local
optimization minima
is used only are
as anecessarily
subroutineglobal
of someminima. However,
deep learning most problems
algorithms. Ideas
in deep
from thelearning
analysisare difficult
of conv
convex to express inalgorithms
ex optimization terms of conv
can ex
be optimization. Convthe
useful for proving ex
optimization
con
convvergence of is deep
used only as aalgorithms.
learning subroutine ofHowsome
Howevev deep
ever,
er, learningthe
in general, algorithms.
imp
importanceIdeas
ortance of
from
con
conv the analysis of conv ex optimization algorithms can b e useful
vex optimization is greatly diminished in the context of deep learning. For for proving the
convergence
more of deep
information ab learning
about
out conv
convex algorithms.
ex Howsee
optimization, ever,
Boin
ydgeneral,
Boyd and Vandenthe imp
andenb ortance
berghe (2004of)
conRo
or vex
Rock
ckoptimization
ckafellar
afellar (1997).is greatly diminished in the context of deep learning. For
more information about convex optimization, see Boyd and Vandenberghe (2004)
or Rockafellar (1997).
4.4 Constrained Optimization

4.4 Constrained
Sometimes we wish not only Optimization
to maximize or minimize a function f (x) ov over
er all
possible values of x. Instead we ma mayy wish to find the maximal or minimal value of
Sometimes we wish not only to maximize
f (x) for values of x in some set S. This is known or minimize
as constr
onstrainea function
aine
ained f (x) ov
d optimization
optimization. er all
. Poin
Pointsts
p ossible values of x . Instead w
x that lie within the set S areScalled fee ma y wish to
feasible find
asible poin the
oints maximal or minimal
ts in constrained optimization v alue of
fterminology.
(x) for values
terminology . of x in some set . This is known as constrained optimization. Points
S
x that lie within the set are called feasible points in constrained optimization
We often wish to find a solution that is small in some sense. A common
terminology.
approac
approach h in such situations is to imp imposeose a norm constrain
constraint, t, such as ||x|| ≤ 1.
We often wish to find a solution that is small in some sense. A common
One hsimple
approac approac
approach
in such h to constrained
situations is to imposeoptimization
a norm constrain is simply
t, suchto as
mo x gradient
modify
dify 1.
descentt taking the constraint into account. If we use a small constant step size ,
descen
Onemake
we can simple approac
gradien
gradient h to constrained
t descent steps, thenoptimization
pro ject the is
project simply
result bac
back to mo||dify
k into
|| ≤
S. Ifgradient
we use
descen t
a line searctaking
search, the constraint
h, we can search only ov into account. If we use a small
er step sizes  that yield new x poin
over constant step
oints size ,
Sts that are
we can make
feasible, or wegradien
can pro t descent
ject each
project steps,
pointthen pro ject
on the line the
back result
into bac
thekconstraint
into . If we use
region.
a line searc
When h, wethis
possible, canmetho
searchd only
method can bov e er stepmore
made  that yield
sizes efficient new
by pro x poin
projecting
jecting ts that
the are
gradient
feasible,
in
into or we can
to the tangen
tangent pro ject
t space each
of the point region
feasible on the blineeforeback into the
taking the step
constraint region.
or beginning
When p ossible, this metho
the line search (Rosen, 1960). d can b e made more efficient by pro jecting the gradient
into the tangent space of the feasible region before taking the step or beginning
the A more
line searchsophisticated
(Rosen, 1960 approach
). is to design a different, unconstrained opti-
mization problem whose solution can be conv converted
erted in into
to a solution to the original,
A more sophisticated approach is to
constrained optimization problem. For example, if we wan design a different,
wantt tounconstrained
minimize f( x)opti-for
mization2 problem whose
x ∈ R with x constrained to ha solution
hav can b e conv erted2 in to a solution
ve exactly unit L norm, we can instead minimize to the original,
constrained optimization problem. For example, if we want to minimize f( x) for
R
x with x constrained to have exactly unit L norm, we can instead minimize
93

CHAPTER 4. NUMERICAL COMPUTATION

g(θ ) = f ([cos θ, sin θ]> ) with resp ect to θ, then return [ cos θ, sin θ] as the solution
respect
to the original problem. This approac approach h requires creativit
creativity; y; the transformation
gb(et
θ
etw) = f ([ cos θ, sin θ] ) with resp ect
ween optimization problems must be designed sp to θ , then return [ cos θ,
specifically
ecifically sin θfor
] aseac
theh solution
each case we
to the
encoun
encounter.original
ter. problem. This approac h requires creativit y; the transformation
between optimization problems must be designed specifically for each case we
The Karush–Kuhn–T
Karush–Kuhn–Tucker ucker (KKT) approach1 pro provides
vides a very general solution
encounter.
to constrained optimization. With the KKT approach, we in intro
tro
troduceduce a new function
The Karush–Kuhn–T
called the genergeneralizealize
alized
d Lagrucker
agrangian(KKT)
angian or gener approach
generalize
alize
alized pro
d Lagr vides
agrange a very
ange function general
. solution
to constrained optimization. With the KKT approach, we introduce a new function
To define the Lagrangian, we first need to describ describee in terms of equations
called the generalized Lagrangian or generalized LagrangeSfunction.
and inequalities. W antt a description of S in terms ofSm functions g (i) and n
Wee wan
To define
functions h(j) the Lagrangian,
so that S = {x | ∀we i, gfirst
(i) (x)need
= 0Sto anddescrib
∀j, h (je)( x )in≤terms
0} . Theof equations
equations
and
in inequalities.
( i )
volving g are called
inv W e w an t a description of in terms of m functions g h (nj)
and
S the equality constr onstraints
aints and the inequalities inv involving
olving
functions
are h
called ine =
so thatconstr
inequality
quality x
onstraints
aints
aints..i, g (x ) = 0 and j, h ( x ) 0 . The equations
involving g are called the{ equality |∀ c onstr aints and
∀ the ≤ } involving h
inequalities
We in intro
tro
troduceduce new variables λi and α j for each constraint, these are called the
are called inequality constraints.
KKT multipliers. The generalized Lagrangian is then defined as
We introduce new variables λ and Xα for each constraint,
X these are called the
KKT multipliers. The generalized Lagrangian (i) is then defined (j ) as
L(x, λ, α) = f (x) + λ ig (x) + α j h (x). (4.14)
i j
L(x, λ, α) = f (x) + λ g (x) + α h (x). (4.14)
We can nonow
w solve a constrained minimization problem using unconstrained
optimization of the generalized Lagrangian. Observe that, so long as at least one
We can
feasible noexists
point w solve a constrained
and f (x) is not pminimization problem
ermitted to hav
have e value using unconstrained
∞, then
X
optimization of the generalized Lagrangian. ObserveX that, so long as at least one
feasible point exists and f (xmin
) ismax
not pmax
ermitted
L(x, to hav
λ, α ). e value , then (4.15)
x λ α,α≥0

min max max L(x, λ, α). (4.15)
has the same optimal ob objectiv
jectiv
jectivee function value and set of optimal points x as

has the same optimal ob jective function


min fv(alue
x). and set of optimal points x (4.16)
as
x∈S
min f (x). are satisfied,
This follows because any time the constraints (4.16)

This follows because any time


max the
maxconstraints
L(x, λ, α)are
= fsatisfied,
(x), (4.17)
λ α,α≥0
max max L(x, λ, α) = f (x), (4.17)
while any time a constraint is violated,

while any time a constraintmax


is violated,
max L(x, λ, α) = ∞. (4.18)
λ α,α≥0
max max L(x, λ, α) = . (4.18)
These prop
properties
erties guarantee that no infeasible poin ointt will ever be optimal, and that
the optimum within the feasible poin oints
ts is unchanged. ∞
These properties guarantee that no infeasible point will ever be optimal, and that
1
the The KKT approach
optimum generalizes
within the feasiblethepoin
method
ts is of Lagrange multipliers which allows equality
unchanged.
constraints but not inequality constraints.

94
CHAPTER 4. NUMERICAL COMPUTATION

To perform constrained maximization, we can construct the generalized La-


grange function of −f (x), whic which h leads to this optimization problem:
To perform constrained maximization, we can construct the generalized La-
X X
grange function minofmax ), whic
f (xmax −fh(xleads
) + to λ this (i)optimization problem:
α j h(j)(x).
ig (x) + (4.19)
x
−λ α,α≥0 i j
min max max f (x) + λ g (x) + α h (x). (4.19)
We ma may y also conv
convert ert this to − a problem with maximization in the outer lo loop:
op:
X X
We may also conv max ert
minthisminto afproblem
(x) + with λ ig (imaximization
)
(x) − α j hin
(j )the outer lo op:
(x). (4.20)
x λ α , α ≥0 X X
i j
max min min f (x) + λ g (x) α h (x). (4.20)
The sign of the term for the equality constraints do does
−es not matter; we may define it
with addition or subtraction as we wish, because the optimization is free to cho hoose
ose
The
an
any sign of the
y sign for each λi. term for the equality constraints do es not matter; we may define it
with addition or subtraction as we wish, Xbecause the optimization X is free to choose
The inequality constrainconstraints ts are particularly in interesting.
teresting. We say that a constraint
any sign for each λ .
h(i) (x ) is active if h(i) ( x ∗) = 00.. If a constraint is not activ active, e, then the solution to
The inequality constrain
the problem found using that constrain ts are particularly in teresting.
constraintt would remain at least W e sayathat lo cala solution
local constraint if
h (x ) is
that constrain active if
constraintt were remo h ( x
remov ) = 0 . If a constraint
ved. It is possible that an inactivis not activ e, then
inactivee constrain the solution
constraintt excludes to
the problem
other solutions. found Forusing that constrain
example, a conv
convexex tproblem
would remain with anatentireleast region
a local of solution
globally if
that constrain
optimal poin tst w
oints (aere remoflat,
wide, ved. region
It is possible
of equalthat cost)an could
inactivhav e constrain
have e a subset t excludes
of this
other solutions. F or
region eliminated by constrain example,
constraints, a conv ex problem
ts, or a non-conv
non-convex with an entire
ex problem could hav region of globally
havee better lo local
cal
optimal p
stationary poin oin ts
oints (a wide, flat, region of equal
ts excluded by a constraint that is inactiv cost) could
inactivee at convhav e a
convergence. subset
ergence. How of
Howev this
ev
ever,
er,
region
the eliminated
point found at byconv
constrain
convergence
ergence ts, remains
or a non-conv ex problem
a stationary point could have bor
whether etter
notlothe
cal
stationary
inactiv points excluded
inactivee constrain
constraints by a constraint
ts are included. Because that is inactiv
an inactive h(ei) at
hasconv ergence.
negativ
negative Howthen
e value, ever,
the solution
point found to min at conv ergence remains a stationaryhave point
e αi whether
= 0. Weorcan notthus
the
the x maxλ max α,α≥0 L( x, λ, α) will hav
inactiv e constrain ts are
observee that at the solution, αh
observ included. Because an inactive h has negativ
αh((x ) = 0. In other words, for all i , we know that at e value, then
least one of the constraints i ≥ 0 andLh( (xi),(λ
the solution to min max max
α x,)α≤ ) will
0 musthavbee αactive = 0.atW e can
the thus
solution.
observ
T o gain e that
someatintuition
the solution, αh(xidea,
for this ) = 0we. Incan
other saywords,
that for eitherall ithe
, wesolution
know that at
is on
leastboundary
the one of theimp constraints
imposedosed by the α inequalit
0 and hy and
inequality (x ) we 0mmust ust use be itsactive
KKT at mthe solution.
ultiplier to
T o gain some intuition for
influence the solution to x, or the inequalitthis
≥ idea, we
inequality can ≤say that either the
y has no influence on the solution and solution is on
the b oundary imp osed by the inequalit
we represent this by zeroing out its KKT multiplier. y and we must use its KKT multiplier to
influence the solution to x, or the inequality has no influence on the solution and
The prop properties
erties that the gradien gradientt of the generalized Lagrangian is zero, all
we represent this by zeroing out its KKT multiplier.
constrain
constraints ts on both x and the KKT multipliers are satisfied, and α  h (x) = 0
are called theerties
The prop that the gradien
Karush-Kuhn-T
Karush-Kuhn-Tuc uc
uck t of theconditions
ker (KKT) generalized Lagrangian
(Karush , 1939;isKuhn zero,andall
constrain
Tuck
uckerer, 1951ts on
). T oth x and
bogether, thesethepropKKT
properties multipliers
erties describee are
describ satisfied,poin
the optimal oints h (x) = 0
andtsαof constrained
are called theproblems.
optimization Karush-Kuhn-Tucker (KKT) conditions (Karush, 1939 ; Kuhn and
Tucker, 1951). Together, these properties describe the optimal points of constrained
For more information
optimization problems. ab about
out the KKT approach, see No Nocedal
cedal and Wrigh rightt (2006).

For more information about the KKT approach, see Nocedal and Wright (2006).

95
CHAPTER 4. NUMERICAL COMPUTATION

4.5 Example: Linear Least Squares

4.5 oseExample:
Supp
Suppose we wan Linear
wantt to find Least
the value Squares
of x that minimizes

Suppose we want to find the value of 1x that minimizes


f (x) = ||Ax − b||22 . (4.21)
2
1
There are sp
specialized f (x)algorithms
ecialized linear algebra = Ax that
b can (4.21) .
. solve this problem efficiently
efficiently.
2
Ho
Howwev
ever,
er, we can also explore ho how || it −
w to solve || gradien
using gradient-based
t-based optimization as
There are sp ecialized
a simple example of ho linear
how algebra algorithms
w these techniques work.that can solve this problem efficiently.
However, we can also explore how to solve it using gradient-based optimization as
First, we need to obtain the gradient:
a simple example of how these techniques work.
First, we need to
∇obtain A>
the
xf (x) = (Ax − b) = A>Ax − A >b.
gradient: (4.22)

We can then follo w fthis


follow = A (Ax
(x)gradient do b) =taking
downhill,
wnhill, A Ax smallAsteps.
b. See Algorithm (4.22)
4.1
for details. ∇ − −
We can then follow this gradient downhill, taking small steps. See Algorithm 4.1
for details. 4.1 An algorithm to minimize f( x) = 12 ||Ax − b||22 with resp
Algorithm ect to x
respect
using gradient descent.
Algorithm 4.1 An algorithm to minimize f( x) = Ax b with respect to x
Set the step size () and tolerance (δ ) to small, positive num umb bers.
using gradient >
descent.> || − ||
while || ||A
A Ax >−(A b|| 2 > δ do
Setx the
← xstep
−  size
A Ax ) and
− Atolerance
>b (δ ) to small, positive numbers.
while
end whileA Ax A b > δ do
x ||  A −Ax ||A b
x
end ← while
One can−also solve this − problem using Newton’s metho method.
d. In this case, because
the true function is quadratic, the quadratic approximation employ employeded by Newton’s
One
metho
method can also
d is exact, solve this problem using Newton’s metho d. In this case,
in b
a ecause
 and the algorithm  conv
converges
erges to the global minimum single
the true
step. function is quadratic, the quadratic approximation employ ed by Newton’s
method is exact, and the algorithm converges to the global minimum in a single
No
Now
step. w supp
suppose
ose wwee wish to minimize the same function, but sub subject
ject to the
> x ≤ 1. To do so, we introduce the Lagrangian
constrain
constraintt x
Now suppose we wish to minimize the same function,  but sub ject to the
constraint x x 1. To do so, we introduce the
> Lagrangian
L(x, λ) = f (x) + λ x x − 1 . (4.23)

We can now solve the problem L(x, λ) = f (x) + λ x x 1 . (4.23)

We can now solve the problem min max L(x, λ). (4.24)
x λ,λ≥0
min max L( x, λ).  (4.24)
The smallest-norm solution to the unconstrained least squares problem may be
found using the Mo
Moore-Penrose
ore-Penrose pseudoinv erse: x = A+ b. If this point is feasible,
pseudoinverse:
then it is the solution to the constrained problem.least
The smallest-norm solution to the unconstrained squares problem
Otherwise, we mustmayfindbae
found using the Moore-Penrose pseudoinverse: x = A b. If this point is feasible,
96 problem. Otherwise, we must find a
then it is the solution to the constrained
CHAPTER 4. NUMERICAL COMPUTATION

solution where the constraint is active. By differentiating the Lagrangian with


resp
respect
ect to x, we obtain the equation
solution where the constraint is active. By differentiating the Lagrangian with
>
respect to x, we obtain theA Ax − A> b + 2λx = 00..
equation (4.25)

A will
This tells us that the solution Ax take b +form
A the 2λx = 0. (4.25)

This tells us that the solution (A >
x =will take
A +the ) −1A >b.
2λIform (4.26)

The magnitude of λ must bex chosen= (A suc A


such+
h 2that
λI ) theAresult
b. ob obeys
eys the constrain (4.26)
constraint.t. We
can find this value by performing gradient ascent on λ. To do so, observ observee
The magnitude of λ must be chosen such that the result obeys the constraint. We
can find this value by performing ∂ gradient ascent on λ. To do so, observe
L(x, λ) = x> x − 1. (4.27)
∂λ

When the norm of x exceeds 1,∂ this L(x , λ) ative
= x isx positiv
1. e, so to follow the deriv (4.27)
λ deriv
derivative ositive, derivativ
ativ
ativee
uphill and increase the Lagrangian with resp respect −
ect to λ, we increase λ . Because the
When
co
coefficien
efficien
efficient on theofxx> x
thet norm exceeds
penalt1,
enalty this increased,
y has derivative solving
is positiv e, so
the to follow
linear the deriv
equation for xativ
wille
uphill
no
now andaincrease
w yield solutionthe
withLagrangian
smaller norm. withThe
resppro to λof
ect cess
process , we theλlinear
increase
solving . Because the
equation
coefficien
and t on the
adjusting x
λ con x
contin
tin penalt
tinues
ues y has
until increased,
x has solving
the correct normtheand
linear
theequation
deriv
derivativfor
ativ x
e on λ
ative will
is
no
0. w yield a solution with smaller norm. The pro cess of solving the linear equation
and adjusting λ continues until x has the correct norm and the derivative on λ is
This concludes the mathematical preliminaries that we use to dev develop
elop machine
0.
learning algorithms. We are no noww ready to build and analyze some full-fledged
This concludes
learning systems. the mathematical preliminaries that we use to develop machine
learning algorithms. We are now ready to build and analyze some full-fledged
learning systems.

97
Chapter 5
Chapter 5
Mac
Machine
hine Learning Basics
Mac hine
Deep learning is a sp
Learning
specific
ecific kind of mac
Basics
machine
hine learning. In order to understand
deep learning well, one must ha have
ve a solid understanding of the basic principles
Deep
of mac learning
machine is a spThis
hine learning. ecificchapter
kind ofpro mac
provides hinea brief
vides learning.
courseIninorder
the mostto understand
important
deep learning w ell, one must ha ve a solid understanding
general principles that will be applied throughout the rest of the bo of the basic
book.
ok.principles
No
Novice
vice
of mac hine learning.
readers or those who wan This c hapter
antt a wider persppro vides
perspectiv
ectiv a brief course in the most
ectivee are encouraged to consider mac important
machine
hine
general principles
learning textb
textbo that will b e applied
ooks with a more comprehensive co throughout cov the rest of the
verage of the fundamen bo ok.
fundamentals, No vice
tals, suc
suchh
readers
as Murph
Murphyor ythose
(2012 who
) orwBishop
ant a wider (2006persp
). Ifectiv
youe are
are encouraged to consider
already familiar with mac machine
hine
machine
learning textb
learning basics, ooks
feelwith
free atomore
skip comprehensive
ahead to Sec. 5.11 coverage
. Thatofsection
the fundamen
cov
covers tals, suc
ers some h
per-
as
sp Murph
spectiv
ectiv
ectives y (traditional
es on 2012) or Bishop mac
machine
hine(2006 ). If tec
learning youhniques
are already
techniques familiar
that hav
havee stronglywithinfluenced
machine
learning
the dev basics,
developmen
elopmen feel free to skip ahead
elopmentt of deep learning algorithms. to Sec. 5.11 . That section cov ers some per-
spectives on traditional machine learning techniques that have strongly influenced
the Wdeve elopmen
begin with t of adeep
definition
learningofalgorithms.
what a learning algorithm is, and present an
example: the linear regression algorithm. W Wee then pro proceed
ceed to describ
describee how the
W e b egin with a definition of what a learning
challenge of fitting the training data differs from the challenge of finding algorithm is, and present an
patterns
example:
that the linear
generalize to new regression
data. Mostalgorithm.
mac
machineWe learning
hine then proceed to describ
algorithms hav
have ee how the
settings
challenge
called hyp of fitting the training
yperparameters
erparameters that must data differs fromexternal
be determined the challenge
to the of findingalgorithm
learning patterns
that generalize
itself; we discusstoho new
how w todata. Mostusing
set these machine learning
additio
additional algorithms
nal data. Mac
Machine havlearning
hine e settings is
called
essen h yp
essentially erparameters that m ust b e determined external
tially a form of applied statistics with increased emphasis on the use of to the learning algorithm
itself;
computers we discuss how toestimate
to statistically set these using additio
complicated nal data.
functions and aMac hine learning
decreased emphasis is
essen
on tially
pro ving aconfidence
proving form of applied
in
interv
terv alsstatistics
tervals around thesewith functions;
increased w emphasis
e therefore on present
the usethe of
tcomputers
wo centraltoapproac
statistically
approaches hes to estimate
statistics:complicated
frequen tistfunctions
frequentist estimators andanda decreased
Ba
Bay emphasis
yesian inference.
on promachine
Most ving confidence
learning in tervals around
algorithms can bethese functions;
divided in to thewcategories
into e thereforeofpresent
sup
supervised the
ervised
tlearning
wo central approac
and unsup hes
unsupervised to statistics: frequen
ervised learning; we describ tist estimators and Ba yesian
describee these categories and give some inference.
Most machine
examples learning
of simple algorithms
learning can befrom
algorithms divided
eachinto the categories
category
category. . Most deep of sup ervised
learning
learning andare
algorithms unsup
basedervised
on an learning; we describ
optimization e thesecalled
algorithm categories and give
stochastic somet
gradien
gradient
examples
descen
descent. t. W ofe simple
describe learning
how toalgorithms
com
combine from each
bine various category
algorithm . Most
comp
components
onentsdeepsuc
such learning
h as an
algorithms are based on an optimization algorithm called stochastic gradient
descent. We describe how to combine v98 arious algorithm components such as an

98
CHAPTER 5. MACHINE LEARNING BASICS

optimization algorithm, a cost function, a model, and a dataset to build a mac machine
hine
learning algorithm. Finally
Finally,, in Sec. 5.11, we describe some of the factors that hav havee
optimization
limited algorithm,
the ability a cost function,
of traditional mac hinea learning
machine model, and a dataset toThese
to generalize. build challenges
a machine
learning
ha
hav algorithm.
ve motivated the Finally
motivated dev , in Sec.
developmen
elopmen
elopment t of 5.11
deep, w e describe
learning some of the
algorithms thatfactors
ov that these
overcome
ercome have
limited the
obstacles. ability of traditional mac hine learning to generalize. These challenges
have motivated the development of deep learning algorithms that overcome these
obstacles.
5.1 Learning Algorithms

5.1
A machineLearning Algorithms
learning algorithm is an algorithm that is able to learn from data. But
what do we mean by learning? Mitchell (1997) provides the definition “A computer
A machine
program is learning algorithm
said to learn is anerience
from exp algorithm
experience thatresp
E with is able
ect totosome
respect learnclass
fromofdata.
tasksBut
T
what do w e mean by learning? Mitchell ( 1997) provides the definition
and performance measure P , if its performance at tasks in T , as measured by P , “A computer
program
impro
improv ves is saidexp
with toerience
learn from
experience experience
E .” One E witha resp
can imagine veryect to v
wide some
arietyclass tasks T
of eriences
of exp
experiences
and
E performance
, tasks measure P ,measures
T , and performance if its performance
P , and weat dotasks in Te, any
not mak
make as measured P,
attempt inbythis
impro
b ook tovesprovide
with exp a erience E .” One can
formal definition imagine
of what ma
may yabveery wide
used for veac
ariety
h ofofthese
each experiences
entities.
E , tasks T , and
Instead, the follo p erformance
following measures Pintuitivee descriptions and examples in
,
wing sections provide intuitiv and we do not mak e any attempt of this
the
b o ok
differento provide a formal definition of what
differentt kinds of tasks, performance measures and exp ma y b e used for eac
experiences h of these entities.
eriences that can be used
Instead, the following sections provide
to construct machine learning algorithms. intuitiv e descriptions and examples of the
different kinds of tasks, performance measures and experiences that can be used
to construct machine learning algorithms.
5.1.1 The Task, T
T
5.1.1
Mac hine The
Machine learniTng
ask,
learning allo
allows
ws us to tac
tackle
kle tasks that are too difficult to solv
solvee with
fixed programs written and designed by human beings. From a scien scientific
tific and
Mac hine learni ng allows us to tackle
philosophical point of view, machine learning is in tasks that are too
interesting difficult to solv
teresting because developing e with
our
fixed programs written and
understanding of machine learning en designed b y
entailshuman b eings. F rom a scien
tails developing our understanding of the tific and
philosophical
principles thatpoint of view,
underlie machine learning is interesting because developing our
intelligence.
understanding of machine learning entails developing our understanding of the
In this that
principles relativ
relatively
ely formal
underlie definition of the word “task,” the pro
intelligence. process
cess of learning
itself is not the task. Learning is our means of attaining the ability to perform the
task.InFthis relatively
or example, if formal
we wan antdefinition
t a rob
robot
ot toof bthe word
e able to “task,” the pro
walk, then cess of
walking learning
is the task.
Witself is not
e could the task.
program theLearning
rob ot to islearn
robot our to
means
walk,oforattaining
we couldthe abilitytotodirectly
attempt performwritethe
atask. For example,
program that sp if we w
specifies
ecifies ant to
how a rob
walkot man
to bually
e able. to walk, then walking is the task.
manually
ually.
We could program the rob ot to learn to walk, or we could attempt to directly write
Mac
Machine
hine learning tasks are usually describ described ed in terms of ho howw the mac machine
hine
a program that specifies how to walk manually.
learning system should pro process
cess an example . An example is a collection of fe
featur
atur
atures
es
thatMac ha
havvhine
e beenlearning tasksely
quantitativ
quantitatively aremeasured
usually describ
from someed inobterms
ject orofev
object ho
even
enw
ent the mac
t that we w hine
an
antt
learning
the mac system
machine
hine shouldsystem
learning processtoan example
pro
process.
cess. W . eAn examplerepresen
typically is a collection
represent of featur
t an example asesa
that ha v e b een
n
vector x ∈ R where eacquantitativ
each ely
h en try xi of the vector is another feature. For example,t
entry measured from some ob ject or ev en t that w e w an
the features
the machine of learning system to process. We typically represent an example as a
R an image are usually the values of the pixels in the image.
vector x where each entry x of the vector is another feature. For example,
the features∈ of an image are usually the values of the pixels in the image.
99
CHAPTER 5. MACHINE LEARNING BASICS

Man
Manyy kinds of tasks can be solved with machine learning. Some of the most
common mac
machine
hine learning tasks include the following:
Many kinds of tasks can be solved with machine learning. Some of the most
common machine
• Classific
Classification
ationlearning
ation:: In this tasks
typ include
ypee of the computer
task, the following: program is ask
asked
ed to sp
specify
ecify
whic
which h of k categories some input belongs to. To solv solvee this task, the learning
Classification
algorithm : In thisask
is usually t
askedyp
ed e of
to task,
pro
produce
duce the a computer
function fprogram
: Rn → is {1ask
, . . ed
. , kto}. sp ecify
When
• ywhic= fh(ofx)k, thecategories
mo
modeldel some
assigns input belongs
an input to. To solv
described byR evector
this task,
x tothe learning
a category
algorithm
iden
identified
tified b isyusually
numeric ask ed
co detoy.pro
code duce are
There a function
other varianf : ts of the
ariants 1, . .classification
. , k . When
y = f (x ) , the mo del assigns an input
task, for example, where f outputs a probability distribution described b y vector
→{ x to ov aer
over }category
classes.
idenexample
An tified by of numeric code y. There
a classification task is areob other
object variants of the
ject recognition, where classification
the input
task,
is for example,
an image (usually where f outputs
described as a set a probability distribution
of pixel brightness overand
values), classes.
the
An example
output is a numeric co of a classification
code task
de identifying the ob is ob ject
object recognition, where
ject in the image. For example, the input
is an
the Willoimage
Willow (usually
w Garage PR2 rob described
robotot is able to act as abrightness
as a set of pixel waiter that values), and the
can recognize
output tis kinds
differen
different a numeric code and
of drinks identifying
deliv
deliver er the
them ob ject in the on
to people image.
command For example,
(Go
Goo o d-
the
fello Willo
felloww et al. w Garage
al.,, 2010). Mo PR2
Modern rob ot
dern ob is
object able to act as a w aiter
ject recognition is best accomplished withthat can recognize
differen t kinds of drinks
deep learning (Krizhevsky et al., 2012 and deliv er them ; Ioffe to and
people on command
Szegedy , 2015). Ob (Goject
Objectod-
fello w et al. , 2010 ).
recognition is the same basic tec Mo dern ob ject
technology recognition
hnology that allo is
allows b est accomplished
ws computers to recognize with
deep learning
faces (Taigman et al. ( Krizhevsky et al. , 2012 ; Ioffe
al.,, 2014), which can be used to automaticallyand Szegedy, 2015tag ). Ob ject
people
recognition
in is the same
photo collections andbasicallow teccomputers
hnology that to allo ws computers
interact more naturally to recognize with
faces ( T
their users. aigman et al. , 2014 ), which can b e used to automatically tag p eople
in photo collections and allow computers to interact more naturally with
• Classific
Classification
their ation with missing inputs
users. inputs:: Classification b ecomes more challenging if
the computer program is not guaranteed that every measurement in its input
vClassific
ector will ation alwa
alwayswith
ys bmissing
e pro inputsIn
provided.
vided. : Classification
order to solv solveeb the
ecomes more challenging
classification task, the if
the computer
• learning algorithm program onlyishas nottoguaranteed
define a single that function
every measurement
mapping from in its input
a vector
vector to
input willa alwa ys be pro
categorical vided. When
output. In order someto solvof ethetheinputs
classification
may be task, the
missing,
learningthan
rather algorithm
providing onlya hassingleto define a singlefunction,
classification function the mapping
learning from a vector
algorithm
input to a categorical
must learn a set of functions. Eac output. EachWhen some
h function corresp of the
corresponds inputs may b
onds to classifying x withe missing,
arather
differen
differentthan providing
t subset of its ainputs
singlemissing.
classificationThis kind function, the learning
of situation algorithm
arises frequen
frequently tly
m ust learn a set of functions. Eac h function
in medical diagnosis, because many kinds of medical tests are exp corresp onds to classifying
expensiv x
ensiv with
ensivee or
ain
invdifferen t
vasive. One wasubset way of its inputs
y to efficien
efficiently missing.
tly define suc This
such kind of situation arises
h a large set of functions is to learn frequen tly
in medical diagnosis, b ecause
a probability distribution over all of the relev many kinds of medical
relevant tests are
ant variables, then solv exp ensiv e
solvee the or
invasive. Onetask
classification way by to efficien tly define
marginalizing outsuc theh amissing
large setvariables.
of functions With is ntoinput
learn
vaariables,
probability we can distribution
no
noww obtain over allall
2n ofdifferen
the relev
different ant variables,
t classification then solv
functions e the
needed
classification
for eac
each h possible task setby ofmarginalizing
missing inputs, out the butmissing
we onlyvariables.
need to learn With an singleinput
variables, we can
function describing the join no w obtain all 2
jointt probabilit
probabilitydifferen t classification
y distribution. See Go functions
Goo odfello
dfellow needed
w et al.
for eac h p ossible set of missing
(2013b) for an example of a deep probabilistic mo inputs, but w e only
modeldel applied to such asingle
need to learn a task
function
in this way describing
ay. . Man
Many y ofthethe join t probabilit
other y distribution.
tasks describ
described ed in thisSee Goodfello
section can w alsoet bal.
e
(generalized
2013b) for an example of a deep probabilistic mo del
to work with missing inputs; classification with missing inputs is applied to such a task
in this w ay
just one example . Manyof of what the machine
other tasks describ
learning can eddo.in this section can also be
generalized to work with missing inputs; classification with missing inputs is
just one example of what machine100learning can do.
CHAPTER 5. MACHINE LEARNING BASICS

• Regrgression
ession
ession:: In this type of task, the computer program is ask asked
ed to predict a
numerical value giv given en some input. To solv solvee this task, the learning algorithm
Reask
is gression
asked ed to: output
In this taype of task,f the
function : Rncomputer
→ R. This programtype of is ask
taskedistosimilar
predicttoa
numerical value
• classification, given that
except somethe input.
format To solve this task, the learning algorithm
R of outputR is differen
different. t. An example of
is ask ed to output a function
a regression task is the prediction of the exp f : . This
expected type of task is
ected claim amount that an similar to
classification,
insured personexcept will mak that
make e (usedthe format →of outputpremiums),
to set insurance is different.orAn theexample
prediction of
a regression
of future prices taskofissecurities.
the prediction Theseofkinds the exp ected claim are
of predictions amount
also usedthat foran
insured person
algorithmic will make (used to set insurance premiums), or the prediction
trading.
of future prices of securities. These kinds of predictions are also used for
• T
algorithmic
ranscription
anscription: trading.
: In this type of task, the machine learning system is ask asked
ed to
observ
observee a relativ
relatively ely unstructured represen representationtation of some kind of data and
T ranscription
transcrib : In this type of task,
transcribee it into discrete, textual form. For example, the machine learninginsystemopticalischaracter
asked to
observe a relativ
• recognition, the ely unstructured
computer program represen
is shown tation of some kind
a photograph of data and
containing an
transcrib e it into discrete, textual form. F or
image of text and is asked to return this text in the form of a sequence example, in optical character
recognition,
of charactersthe computer
(e.g., in ASCIIprogram or Unico
Unicode isdeshown
format). a photograph
Go
Google
ogle Streetcontaining
View uses an
image of text and is
deep learning to process address numasked to return
numbersthis text
bers in this wa in the
way form
y (Go
Goo of a
o dfello
dfellow sequence
w et al.
al.,,
of c
2014d haracters
2014d). ). Another (e.g., in
example ASCII is spor
speec
eec
eechUnico
h de format).
recognition, where Go ogle
the Street
computer View uses
program
deep
is pro learning
provided
vided an to audioprocess
waveformaddress andnum bersa in
emits this waof
sequence y (cGo o dfellow
haracters or et al.,
word
2014d
ID co ). Another
codes
des describing example
the wordsis speec
thath recognition,
were sp oken where
spoken in the the audiocomputer
recording. program
Deep
is pro vided an audio
learning is a crucial comp w a veform
component and
onent of mo emits
dern speech recognition systems w
modern a sequence of c haracters or ord
used
ID ma
at codes
major jor describing
companiesthe words that
including were spoken
Microsoft, IBMinand theGo audio
oglerecording.
Google (Hin
Hintonton etDeep al.
al.,,
learning
2012b
2012b). ). is a crucial comp onent of mo dern speech recognition systems used
at ma jor companies including Microsoft, IBM and Google (Hinton et al.,
• 2012b
Machine ). trtranslation
anslation
anslation:: In a mac machine
hine translation task, the input already consists
of a sequence of symbols in some language, and the computer program must
Machine
con
conv vert thistranslation
in
into : In a mac
to a sequence of hine
sym
symb btranslation
ols in another task, the inputThis
language. already consists
is commonly
of a sequence of symbols in some language,
• applied to natural languages, such as to translate from English to Frencand the computer program must
rench.h.
convert
Deep this into
learning a sequence
has recently of symbto
begun olshain
hav v another
e an imp language.
important
ortant This
impact is
on commonly
this kind
applied
of task (to natural
Sutskev
Sutskever er et languages,
al.,, 2014; such
al. Bahdanauas to translate
et al.
al.,, 2015 from
). English to French.
Deep learning has recently begun to have an important impact on this kind
• of
Structur
Structure d output
task (eSutskev
output: er: Structured
et al., 2014output; Bahdanau tasks et inv al.
involve
olve any).task where the output
, 2015
is a vector (or other data structure con containing
taining multiple values) with important
Structur e d
relationships betoutput etw: Structured
ween the differen output
differentt elemen tasks
elements. involve
ts. Thisany is atask
broadwhere the output
category
category, , and
is a v ector (or other data structure
• subsumes the transcription and translation tasks describcon taining m ultiple v alues)
described ed abwith
abovov
ove, important
e, but also
relationships
man
many b etw een the differen t elemen ts.
y other tasks. One example is parsing—mapping a natural language This is a broad category , and
subsumes
sen
sentence
tence inintothe
to a transcription
tree that describ and
describes es translation
its grammatical tasksstructure
described andabtagging
ove, butno also
nodes
des
man y other tasks. One example is parsing—mapping
of the trees as b eing verbs, nouns, or adverbs, and so on. See Collob a natural
Collobert language
ert (2011)
sen tence in to a tree that describ es its grammatical
for an example of deep learning applied to a parsing task. Another example structure and tagging nodes
of pixel-wise
is the trees assegmen b eing verbs,
segmentation tationnouns,
of images,or adverbs,
where the and computer
so on. Seeprogram
Collobertassigns
(2011)
for
ev
everyanpixel
ery examplein anofimagedeep tolearning
a sp applied
specific
ecific categoryto a. parsing
category. For example,task. Another
deep learning example
can
is pixel-wise segmentation of images, where the computer program assigns
every pixel in an image to a specific 101category. For example, deep learning can
CHAPTER 5. MACHINE LEARNING BASICS

be used to annotate the lo locations


cations of roads in aerial photographs (Mnih and
Hin
Hinton
ton, 2010). The output need not hav havee its form mirror the structure of
b e used to annotate the lo cations
the input as closely as in these annotation-st of roads
annotation-style inyleaerial
tasks.photographs
For example, (Mnih and
in image
Hinton, 2010
captioning, the). computer
The output need not
program haves
observ
observes e itsan form
imagemirror the structure
and outputs a natural of
the input
language sen as closely
sentence as in these annotation-st yle tasks.
tence describing the image (Kiros et al., 2014a,b; Mao et al., F or example, in image
captioning,
2015 ; Vin
Vinyals
yalstheetcomputer
al.
al.,, 2015b program
; Donahue observ es ,an
et al.
al., image
2014 and outputs
; Karpath
Karpathy y and aLinatural
, 2015;
language
Fang et al. sen tence describing
al.,, 2015; Xu et al. the image ( Kiros et al. ,
al.,, 2015). These tasks are called structured output2014a ,b ; Mao et al.,
2015
tasks; Vin yals et
because theal.program
, 2015b; must Donahue output et al. , 2014v;alues
several Karpath thaty are
andall Li,tightly
2015;
F
in ang et al.
inter-related. , 2015 ; Xu et al.
ter-related. For example, the words pro , 2015 ). These tasks
produced are called structured
duced by an image captioning output
tasks because
program must the formprogram
a valid sen must
sentence.output several values that are all tightly
tence.
inter-related. For example, the words produced by an image captioning
• A nomaly m
program dete
detection
ustction
ction:
form : Ina vthis
alidtype of task, the computer program sifts through
sentence.
a set of evenevents ts or obobjects,
jects, and flags some of them as being unusual or at atypical.
ypical.
A nomaly dete ction : In this type of task, the
An example of an anomaly detection task is credit card fraud detection. By computer program sifts through
• moa set
modelingof even
deling your ts or ob jects, and
purchasing flags asome
habits, creditof them as beingyunusual
card compan
company can detector atmisuse
ypical.
An example
of your cards. of an
If aanomaly
thief steals detection task iscard
your credit credit or card
credit fraud
carddetection.
information, By
modeling
the thief
thief’s’sypurc
our hases
purchasing
purchases habits,
will often come a credit
from acard differentcompan y can detect
probabilit
probability misuse
y distribution
of your
over purc cards.
purchase
hase typ If
ypesa thief steals your credit card
es than your own. The credit card company can prev or credit card information,
prevenen
entt
the thief
fraud by’splacing
purchases a holdwill on
oftenan come
account from asaso different
soon on as probabilit
that card y
has distribution
been used
overanpurc
for hase types than
uncharacteristic your own.
purchase. See The credit etcard
Chandola al. company
(2009) forcan preven
a survey oft
fraud by detection
anomaly placing a metho hold on
methods. ds.an account as soon as that card has been used
for an uncharacteristic purchase. See Chandola et al. (2009) for a survey of
• Synthesis and sampling
anomaly detection sampling:metho : Inds.this type of task, the mac machinehine learning algorithm
is ask
askeded to generate new examples that are similar to those in the training
Synthesis
data. Syn and sampling
Synthesis
thesis and sampling: In thisvia type of task,learning
machine the maccan hineblearning algorithm
e useful for media
is asked to generate
• applications new
where it can be exp examples
expensiv
ensiv that are similar to those
ensivee or boring for an artist to generate large in the training
data. Syn
volumes of contenthesis and sampling
contentt by hand. F via
For machine learning
or example, video games can bcan e useful for media
automatically
applications where
generate textures for large ob it can b e exp ensiv
objects e or
jects or landscapb oring
landscapes, for an artist to
es, rather than requiringgenerate large
an
vartist
olumes to of
man conten
manually uallyt label
by hand. eachFpixel
or example,
(Luo etvideo games
al., 2013 ). In cansome
automatically
cases, we
w generate
an textures for
antt the sampling large ob jects
or synthesis pro or landscap
procedure
cedure es, rather
to generate some than requiring
specific kind anof
artist
output giv to man
given ually label each
en the input. For example, in a sppixel ( Luo et
speec
eec
eechal. , 2013 ). In some
h synthesis task, we provide a cases, we
written sentence and ask the program to emit an audio some
w an t the sampling or synthesis pro cedure to generate specific
waveform con kind
containing
tainingof
aoutput
sp
spokok
okengivversion
en en the input.of thatFsentence.
or example, in aisspaeec
This h synthesis
kind of structured task, w e provide
output task,a
written
but with sentence
the added and qualification
ask the program thattothere emitisannoaudio single wacorrect
veform output
containing for
a
eac sp
each ok en version of that sentence.
h input, and we explicitly desire a large amoun This is a kind of structured
amountt of variation in the output,output task,
in order for the output to seem more naturalisand
but with the added qualification that there no realistic.
single correct output for
each input, and we explicitly desire a large amount of variation in the output,
• Imputation
in order for of themissing
output to values
values:
seem : Inmore thisnatural
type ofand task, the machine learning
realistic.
algorithm is giv en a new example x ∈ Rn, but with some entries xi of x
given
Imputation
missing. Theofalgorithm
missing must valuesprovide
: In this type of task,
a prediction of thethe machine
values of thelearning
missing
R
• enalgorithm
entries.
tries. is giv en a new example x , but with some entries x of x
missing. The algorithm must provide a∈prediction of the values of the missing
entries. 102
CHAPTER 5. MACHINE LEARNING BASICS

• Denoising
Denoising:: In this type of task, the machine learning algorithm is given in
input a corrupte
orrupted d example x˜ ∈ Rn obtained by an unknown corruption pro process
cess
Denoising
from a cle : In this type of task,
n the machine learning
an example x ∈ R .RThe learner must predict the clean example
clean algorithm is given in
• xinput a cits
orrupte d example x˜ obtained
˜, or
x̃ by an unknown corruption process
from corrupted R x
version more generally predict the conditional
from a
probabilit cle
probability an example x
y distribution p(x∈| x . The
˜x̃). learner m ust predict the clean example
x from its corrupted version ∈ x˜, or more generally predict the conditional
• Density
probabilit y distribution
estimation or pr pob
prob
obability˜ ). mass function estimation: In the densit
(xability
x density y
estimation problem, the machine | learning algorithm is asked to learn a
Density estimation
function pmodel : Rn or → prRob abilityp mass(xfunction
, where ) can be estimation
interpreted: asIna the density
probability
model
estimation
• densit problem, thecontin
machine uous)learning algorithm is asked to learn
(if x isa
density y function R(if x isRcon tin
tinuous) or a probabilit
probability y mass function
function pon the: space that
discrete) , where p
the examples (xw candrawn
) ere be interpreted
from. Toasdoa sucprobability
such h a task
densit
well (w y function
(wee will sp (if
specify x
→ is con tin uous) or a probabilit y
ecify exactly what that means when we discuss performance mass function (if x is
discrete) onP ),
measures thethespace that the needs
algorithm examples were drawn
to learn from. Toof
the structure dothe
suchdata
a task it
w ell (w e will sp
has seen. It must kno ecify exactly
know what that means
w where examples cluster tigh when we discuss
tightly performance
tly and where they
measures
are unlik
unlikelyP ), the algorithm needs
ely to occur. Most of the tasks describ to learn the
described structure
ed ababo of thethat
ove require datathe it
has seen. It m ust kno w where examples cluster
learning algorithm has at least implicitly captured the structure of the tigh tly and where they
are unlikely
probabilit
probability to occur. Most
y distribution. Densitof y
Density the tasks describ
estimation allo
allowsed us
ws abotoveexplicitly
require thatcapturethe
learning
that algorithmInhas
distribution. at leastweimplicitly
principle, can then captured the structureon
p erform computations ofthat
the
probability distribution.
distribution in order to solve Densit y estimation
the other tasksallo aswswell.
us toForexplicitly
example, capture
if we
that
ha
hav distribution. In principle, w e can
ve performed density estimation to obtain a probabilit then p erform computations
probability y distribution pthat
on ( x),
wdistribution
e can use that in order to solvetothe
distribution solvother
solvee the tasks
missing as vwalue
ell. imputation
For example, if wIfe
task.
ahavvalue
e performed densityand
x i is missing estimation
all of the to other
obtainvalues,
a probabilit
denoted y distribution
x −i, are giv p( en,
x),
given,
w e can
then weusekno
knowthat
w thedistribution
distribution to ovsolv
ere itthe
over is missing
given byvalue
p(xi imputation task. If
| x −i). In practice,
a v
densitalue
density x is missing
y estimation do does and all
alwaays allow us to solve all of these relatedgiv
es not alw of the other v alues, denoted x , are en,
tasks,
bthen we in
ecause kno w thecases
many distribution
the required overop iterations
is givenon
operations byp(px(x) arex computationally
). In practice,
densit
in y estimation does not always allow us to solve all of| these related tasks,
intractable.
tractable.
because in many cases the required operations on p( x) are computationally
intractable.
Of course, many other tasks and types of tasks are possible. The typ ypeses of tasks
we list here are in intended
tended only to pro providevide examples of what machin machinee learning can
Of course, many other tasks
do, not to define a rigid taxonomy of tasks. and types of tasks are p ossible. The types of tasks
we list here are intended only to provide examples of what machine learning can
do, not to define a rigid taxonomy of tasks.
5.1.2 The Performance Measure, P
P
5.1.2
In order The
to ev Performance
evaluate
aluate Measure,
the abilities of a mac
machine
hine learning algorithm, we must design
quantitativee measure of its performance. Usually this performance measure P is
a quantitativ
In
sp orderto
specific
ecific tothe
evaluate
task Tthe abilities
b eing of aout
carried macbyhine
thelearning
system. algorithm, we must design
a quantitative measure of its performance. Usually this performance measure P is
For tasks such as classification, classification with missing inputs, and transcrip-
specific to the task T b eing carried out by the system.
tion, we often measure the ac accur
cur
curacy
acy of the mo
model.
del. Accuracy is just the prop
proportion
ortion
For tasks such as classification,
of examples for which the mo model classification
del pro
produces with missing inputs, and transcrip-
duces the correct output. We can also obtain
tion, we often measure the accuracy of the model. Accuracy is just the proportion
of examples for which the model produces 103 the correct output. We can also obtain
CHAPTER 5. MACHINE LEARNING BASICS

equiv
equivalent
alent information by measuring the err error
or rate
ate,, the prop
proportion
ortion of examples for
whic
which h the momodeldel produces an incorrect output. We often refer to the error rate as
equiv
the alent
exp
expectedinformation
ected 0-1 loss. by Themeasuring
0-1 loss on theaerr or rate, the
particular proportion
example is 0 ifofitexamples
is correctlyfor
which theand
classified model 1 ifproduces
it is not. an
Forincorrect
tasks suc output.
such We often
h as density refer to itthe
estimation, do
doeserror
es notratemak
makease
the exp ected 0-1
sense to measure accuracy loss. The 0-1 loss on a particular example is
accuracy,, error rate, or any other kind of 0-1 loss. Instead, we0 if it is correctly
classified
m ust use and 1 if it is
a different not. For tasks
performance metricsuchthatas density
gives the estimation,
mo
model it does
del a contin
continuous-v not alued
uous-v make
uous-valued
sense for
score to measure
eac
each accuracyThe
h example. , errormost rate, or any approach
common other kindisofto0-1 reploss.
report
ort theInstead,
averagewe
must use a different
log-probabilit
log-probability y the mo performance
del assigns metric
model to somethat gives the model a continuous-valued
examples.
score for each example. The most common approach is to report the average
Usually we are in interested
terested in ho how w well the mac machine
hine learning algorithm performs
log-probability the model assigns to some examples.
on data that it has not seen before, since this determines ho how
w well it will work when
deploUsually
deploy we are interested in
yed in the real world. We therefore evho w w ell the mac
evaluatehine learning
aluate these performancealgorithm measures
performs
on data
using thatset
a test it has not seen
of data that b
isefore,
separatesincefrom
thisthe
determines
data used hofor
w well it willthe
training workmac when
machine
hine
deplo y ed in
learning system. the real w orld. W e therefore evaluate these p erformance measures
using a test set of data that is separate from the data used for training the machine
The choice of performance measure may seem straightforw straightforward ard and ob objectiv
jectiv
jective,
e,
learning system.
but it is often difficult to choose a performance measure that corresponds well to
the The choice
desired of performance
behavior measure may seem straightforward and ob jective,
of the system.
but it is often difficult to choose a performance measure that corresponds well to
In some cases, this is because it is difficult to decide what should b e measured.
the desired behavior of the system.
For example, when performing a transcription task, should we measure the accuracy
In system
of the some cases, this is because
at transcribing en tireitsequences,
entire is difficult or to should
decide wwhat
e useshould
a morebfine-grained
e measured.
F or example, when
performance measure that givperforming a
gives transcription task, should we
es partial credit for getting some elemen measure the
elements accuracy
ts of the
of the system at transcribing en tire sequences, or should
sequence correct? When performing a regression task, should we penalize the w e use a more fine-grained
performance
system more measure
if it frequen that
tlygiv
frequently es partial
mak
makes credit for mistakes
es medium-sized getting someor if elemen
it rarely ts mak
of the
makeses
sequence correct? When p erforming
very large mistakes? These kinds of design choices dep a regression task,
depend should we penalize
end on the application. the
system more if it frequently makes medium-sized mistakes or if it rarely makes
In other cases, we kno knoww what quan quantittit
tityy we would ideally like to measure, but
very large mistakes? These kinds of design choices depend on the application.
measuring it is impractical. For example, this arises frequently in the context of
In yother
densit
density cases, wMany
estimation. e knowofwhatthe bquan tity we wouldmo
est probabilistic ideally
models like to measure,
dels represen
represent t probabilit
probabilitybuty
measuring it is impractical.
distributions only implicitly For example, this arises
implicitly.. Computing the actual probabilit frequently
probability in the context
y value assigned to of
densit
a sp y
specific estimation.
ecific poin Many
ointt in space in man of the
many b
y sucest
such probabilistic
h mo
models
dels is in mo dels
intractable. represen
tractable. In these t probabilit
cases, oney
distributions only
must design an alternativ implicitly . Computing the
alternativee criterion that still corresp actual
correspondsprobabilit y value
onds to the design ob assigned
objectives,to
jectives,
a sp
or ecific apoin
design go
goo otdinapproximation
space in manytosuc theh desired
models criterion.
is intractable. In these cases, one
must design an alternative criterion that still corresponds to the design ob jectives,
or design a good approximation to the desired criterion.
5.1.3 The Exp
Experience,
erience, E
E
5.1.3
Mac hine The
Machine Expalgorithms
learning erience, can be broadly categorized as unsup
unsupervise
ervise
ervised
d or su-
pervise
ervised
d by what kind of exp
experience
erience they are allow
allowed
ed to ha
hav
ve during the learning
Mac
pro hine
process.
cess. learning algorithms can be broadly categorized as unsupervised or su-
pervised by what kind of experience they are allowed to have during the learning
Most of the learning algorithms in this book can be understo
understoood as being allow
allowed
ed
process.
to exp
experience
erience an en
entire
tire dataset. A dataset is a collection of many examples, as
Most of the learning algorithms in this book can be understood as being allowed
to experience an entire dataset. A dataset
104 is a collection of many examples, as
CHAPTER 5. MACHINE LEARNING BASICS

defined in Sec. 5.1.1. Sometimes we will also call examples data points
oints..
One of the oldest datasets studied by statisticians and mac machine
hine learning re-
defined in Sec. 5.1.1. Sometimes we will also call examples data points.
searc
searchers
hers is the Iris dataset (Fisher, 1936). It is a collection of measuremen measurements ts of
One
differen of the oldest datasets
differentt parts of 150 iris plants. Eac studied
Each by statisticians
h individual plant corresp and
correspondsmac hine learning
onds to one example. re-
searcfeatures
The hers is thewithinIris eac
dataset
each h example(Fisher are, 1936 ). It is a collection
the measurements of each ofofmeasuremen
the parts oftsthe of
differen
plan
plant: t parts
t: the sepal of length,
150 iris sepal
plants.width,Each individual
petal length plant
andcorresp
petal onds
width. to one
Theexample.
dataset
The features within
also records which sp eac h
species example
ecies each plan are the measurements
plantt belonged to. Three differen of each of the
differentt sp parts
species
eciesof are
the
plan t: the
represen
represented tedsepal
in thelength,
dataset. sepal width, petal length and petal width. The dataset
also records which species each plant belonged to. Three different species are
Unsup
Unsupervise
represen ervise
ervised
ted d le
in the learning
arning
dataset. algorithms exp experience
erience a dataset containing many features,
then learn useful prop properties
erties of the structure of this dataset. In the con context
text of deep
Unsup ervise
learning, we usually wand learning algorithms exp erience
antt to learn the entire probabilit a dataset
probability containing many
y distribution that generated features,
athen learn useful
dataset, whether prop erties of as
explicitly theinstructure
densit
density of this dataset.
y estimation In the con
or implicitly text
for of deep
tasks lik
likee
learning,
syn
synthesis w e usually w an t to learn the entire probabilit
thesis or denoising. Some other unsupervised learning algorithms perform othery distribution that generated
a dataset,
roles, whether explicitly
like clustering, whic
which as in densit
h consists y estimation
of dividing the dataset or implicitly
into clusters for tasks like
of similar
synthesis or denoising. Some other unsupervised learning algorithms perform other
examples.
roles, like clustering, which consists of dividing the dataset into clusters of similar
Sup
Supervise
ervise
ervised d lelearning
arning algorithms exp experience
erience a dataset con containing
taining features, but
examples.
eac
eachh example is also asso associated
ciated with a lab label
el or tar
target
get. For example, the Iris dataset
Supervisedwith
is annotated learning
the sp algorithms
species
ecies of each expiris
erience
plant.a A dataset containing
supervised learning features,
algorithm but
eachstudy
can example is also
the Iris associated
dataset and learnwithtoa classify
label or iris
target . Ftsorinto
plan
plants example, the Irist dataset
three differen
different sp
species
ecies
is annotated with
based on their measurements. the sp ecies of each iris plant. A supervised learning algorithm
can study the Iris dataset and learn to classify iris plants into three different species
basedRoughly
on their sp
speaking,
eaking, unsup
unsupervised
measurements. ervised learning inv involves
olves observing sev several
eral examples
of a random vector x, and attempting to implicitly or explicitly learn the proba-
bilit Roughly
bility speaking,
y distribution p(x),unsup
or some ervisedin learningprop
interesting
teresting involves
ertiesobserving
properties several examples
of that distribution, while
of
sup a random
supervised v ector
ervised learning inv x , and
involv
olv
olves attempting
es observing sev to implicitly
several or explicitly learn
eral examples of a random vector x and the proba-
bilitasso
an y distribution
associated
ciated valuep (xor), or
v some
ector y ,in teresting
and learningpropto erties of that
predict distribution,
y from x, usually while
by
sup ervised learning inv
estimating p (y | x). The term sup olv es observing
supervise sev
ervise
ervised eral
d le examples
learning of a random
arning originates from the view of vector x and
an asso ciated
the target y being pro value or vided by an instructor orto
providedv ector y , and learning predict
teac
teacher
her who y from
sho
showswsx,the usually
mac
machine by
hine
estimating
learning p (y
system what x ). Theto do. termIn sup
unsupervise
unsupervised d learning
ervised learning,originates
there isfrom the view or
no instructor of
the
teac target y b eing pro vided b y an instructor or teac her
her, and the |algorithm must learn to make sense of the data without this guide.
teacher, who sho ws the mac hine
learning system what to do. In unsupervised learning, there is no instructor or
Unsup
Unsupervised
ervised learning and supervised learning are not formally defined terms.
teacher, and the algorithm must learn to make sense of the data without this guide.
The lines betw between een them are often blurred. Man Many y machine learning technologies can
Unsup ervised learning and supervised
be used to perform both tasks. For example, the chain learning are not formally
rule defined states
of probability terms.
The lines
that for abetw
vector eenxthem
∈ Rnare , theoften
jointblurred. Manycan
distribution machine learning technologies
be decomposed as can
be used to perform both tasks. For example, the chain rule of probability states
R n
that for a vector x , the jointY distribution can be decomposed as
p(x) = p(x i | x 1, . . . , xi−1). (5.1)
∈ i=1
p (x ) = p(x x , . . . , x ). (5.1)
This decomp
decomposition
osition means that we can solve the ostensibly unsup unsupervised
ervised problem of
|
mo deling p( x) by splitting it in
modeling to n sup
into supervised
ervised learning problems. Alternativ Alternatively ely
ely,, we
This decomposition means that we can solve the ostensibly unsupervised problem of
modeling p( x) by splitting it into nY 105
supervised learning problems. Alternatively, we
CHAPTER 5. MACHINE LEARNING BASICS

can solve the sup ervised learning problem of learning p(y | x) by using traditional
supervised
unsup ervised learning technologies to learn the joint distribution p(x, y) and
unsupervised
can solve the supervised learning problem of learning p(y x) by using traditional
inferring
unsupervised learning technologies to learn p(x, ythe) joint distribution
| p(x, y) and
inferring p ( y | x ) = P 0
. (5.2)
y0 p(x, y )
p(x, y)
Though unsup unsupervised p
ervised learning and sup ( y x ) =
supervised
ervisedp(learning . (5.2)
x, y ) are not completely formal or
distinct concepts, they do help to|roughly categorize some of the things we do with
Though
mac hine unsup
machine learningervised learningTand
algorithms. supervised
raditionally
raditionally, learning
, people referareto not completely
regression, formal or
classification
distinct
and concepts,output
structured they do help to roughly
problems as sup categorize
supervised
ervised some ofDensit
learning. the things
Density we do with
y estimation in
mac
supp hine
support learning algorithms. T
ort of other tasks is usually consideredraditionally , p eople refer
P unsupervised learning. to regression, classification
and structured output problems as supervised learning. Density estimation in
supp Other
ort ofvother
arian
ariants ts of the
tasks learningconsidered
is usually paradigmunsupervised
are p ossible.learning.For example, in semi-
sup
supervised
ervised learning, some examples include a supervision target but others do
not.OtherIn mvulti-instance
ariants of thelearning,learningan paradigm
en
entire are p ossible.
tire collection For example,
of examples is lab in
labeled semi-
eled as
sup
con ervised
containing learning, some examples include a supervision
taining or not containing an example of a class, but the individual mem target but others
members do
bers
not. In m ulti-instance
of the collection are not lab learning,
labeled. an en
eled. For a recen tire collection of examples
recentt example of multi-instance learning is lab eled as
containing
with deep mo or dels,
not containing
models, see Kotziasan et example
al. (2015of ). a class, but the individual members
of the collection are not labeled. For a recent example of multi-instance learning
Some machine learning algorithms do not just exp experience
erience a fixed dataset. For
with deep models, see Kotzias et al. (2015).
example, reinfor
einforccement le learning
arning algorithms interact with an environmen environment, t, so there
Some machine
is a feedback lo loop learning
op b etetw algorithms do not
ween the learning system and its exp just exp erience
experiences. a fixed
eriences. SucSuch dataset.
h algorithms For
example,
are bey
eyond
ondreinfor cement
the scop
scope e of le arning
this book.algorithms
Please seeinteract
Sutton with an environmen
and Barto (1998) ort,Bertsek
so there
Bertsekas as
is a feedback lo op b et w een
and Tsitsiklis (1996) for information abthe learning system
about and its exp eriences. Suc
out reinforcement learning, and Mnih et al. h algorithms
are b ey ond the scop e of this b o ok.
(2013) for the deep learning approach to reinforcemenPlease see Sutton and
reinforcement Barto (1998) or Bertsekas
t learning.
and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al.
(2013Most
) formac
machine
thehine
deeplearning
learningalgorithms
approach simply exp
experience
to reinforcemen erience a dataset. A dataset can
t learning.
be described in many wa ways.ys. In all cases, a dataset is a collection of examples,
whic
whichMost
h aremac hine collections
in turn learning algorithms
of features. simply exp erience a dataset. A dataset can
be described in many ways. In all cases, a dataset is a collection of examples,
One common wa way y of describing a dataset is with a design matrix. A design
which are in turn collections of features.
matrix is a matrix containing a differen differentt example in each ro row.w. Each column of the
One common
matrix corresp
corresponds wa y of differentt feature. For instance, the Irismatrix
onds to a differendescribing a dataset is with a design dataset . Acontains
design
matrix
150 is a matrix
examples with containing
four features a differen
for each t example
example. in This
each ro w. Each
means column
we can of the
represent
matrix corresp onds to a differen t feature. F or instance,
the dataset with a design matrix X ∈ R150×4 , where Xi,1 is the sepal length of the Iris dataset contains
150
plantexamples
plan t i, Xi,2 is with four width
the sepal features for each
of plant example.
i ,Retc. We willThis means
describ
describe e mostwe of
cantherepresent
learning
the datasetinwith
algorithms thisabdesign matrixofXho
ook in terms howw they op , where
operate
erate on X designis the sepal datasets.
matrix length of
plant i, X is the sepal width of plant∈i , etc. We will describe most of the learning
Of course, to describ describee a dataset as a design matrix, it must b e possible to
algorithms in this book in terms of how they operate on design matrix datasets.
describ
describee eaceachh example as a vector, and each of these vectors must be the same size.
Of course,
This is not alw alwa ato
ysdescrib
possible. e aFdataset
or example, as a ifdesign
you ha havmatrix, it mustofb ephotographs
ve a collection possible to
describ
with e each example
different widths and as aheigh
vector,
heights, and each
ts, then different of these vectors must
photographs be the same
will contain size.
differen
different t
This
num
umb is not alw a ys p ossible. F or example,
bers of pixels, so not all of the photographs ma if y ou ha
may v e a collection
y be describ
described of photographs
ed with the same
with
length of vector. Sec. 9.7 and Chapter 10 describe how to handlecontain
different widths and heigh ts, then different photographs will different differen
typest
numbers of pixels, so not all of the photographs may be described with the same
length of vector. Sec. 9.7 and Chapter106 10 describe how to handle different types
CHAPTER 5. MACHINE LEARNING BASICS

of such heterogeneous data. In cases lik likee these, rather than describing the dataset
as a matrix with m ro rows, describee it as a set containing m elemen
ws, we will describ elements:
ts:
of such
(1) heterogeneous
(2) ( m ) data.
{x , x , . . . , x } . This notation do In cases lik
doese these, rather than describing
es not imply that any two example vectors the dataset
as( i a
)
x and x hamatrix ( j ) with
hav m rows,
ve the same size. we will describ e it as a set containing m elements:
x ,x ,...,x . This notation does not imply that any two example vectors
x In the case of sup
supervised
ervised learning, the example con contains
tains a lab
label
el or target as
{ and x have the } same size.
well as a collection of features. For example, if we wan antt to use a learning algorithm
to pIn the case
erform ob
objectof sup
ject ervised learning,
recognition the example
from photographs, we con
need tains
to spa ecify
labelwhich
specify or target
ob as
object
ject
w ell ears
app
appears as a in
collection
each ofofthe features.
photos.ForWexample,
e migh
mightt doif wthis
e wan t to a
with usenumeric
a learning
co algorithm
code,
de, with 0
to p erform ob ject recognition from photographs, w e need
signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when workingto sp ecify which ob ject
appears
with in eachcon
a dataset of taining
the photos.
containing We matrix
a design might do of this withobserv
feature a numeric
ationsco
observations Xde,
, wewith
also0
signifying
pro
provide a person,
vide a vector 1 signifying
of labels a car,
y , with 2 viding
yi pro signifying
providing the alabcat,
label etc.example
el for Often when
i. working
with a dataset containing a design matrix of feature observations X, we also
Of course, sometimes the label ma may y be more than just a single num umb ber. For
provide a vector of labels y , with y providing the label for example i.
example, if we want to train a sp speec
eec
eech
h recognition system to transcribe en entire
tire
sen Of course,
sentences,
tences, thensometimes
the lablabel theeach
el for labelexample
may besen more
sentencethan
tence is ajust a single
sequence of n umber. For
words.
example, if we want to train a sp eech recognition system to transcribe entire
Just asthen
sentences, theretheis nolabformal definition
el for each example of sen
sup
supervised
ervised
tence is and unsup
unsupervised
a sequence ervised
of words.learning,
there is no rigid taxonom taxonomy y of datasets or exp
experiences.
eriences. The structures describ describeded here
co
cov Just as there is
ver most cases, but it is alwano formal definition
always of sup ervised and unsup ervised
ys possible to design new ones for new applications. learning,
there is no rigid taxonomy of datasets or experiences. The structures described here
cover most cases, but it is always possible to design new ones for new applications.
5.1.4 Example: Linear Regression

5.1.4definition
Our Example: Linear
of a mac
machine Regression
hine learning algorithm as an algorithm that is capable
of improving a computer program’s performance at some task via exp experience
erience is
Our definition of a mac hine learning algorithm
somewhat abstract. To make this more concrete, we presen as presentt an example of acapable
an algorithm that is simple
of
mac improving
machine a computer
hine learning program’s
algorithm: line ar rpeerformance
linear gr ession.. Wat
gression
ession somereturn
e will task via experience
to this example is
somewhat
rep eatedly abstract.
repeatedly To make
as we introduce thismac
more more
machine
hine concrete,
learningwe presentthat
concepts an example of a simple
help to understand
macbhine
its learning algorithm: linear regression. We will return to this example
ehavior.
repeatedly as we introduce more machine learning concepts that help to understand
As the name implies, linear regression solv solves
es a regression problem. In other
its behavior.
words, the goal is to build a system that can tak takee a vector x ∈ Rn as input and
As the
predict thename
value implies, linear
of a scalar y ∈ regression solves In
R as its output. a regression
the case ofproblem.
linear In other
regression,
R
w ords,
the the goal
output is a is to build
linear a system
function ofR thethat can Let
input. takeyŷˆabveector x
the value thatas our
inputmo and
model
del
predict the v alue of a scalar y as its
predicts y should take on. We define the output to be output. In the case of
∈ linear regression,
the output is a linear function∈of the input. Let yˆ be the value that our model
>
predicts y should take on. We defineyŷˆ the = woutputx to be (5.3)

where w ∈ Rn is a vector of par arameters


ameters
ameters.
yˆ =. w x (5.3)
R are values that control the behavior of the system. In this case, wi is
Parameters
where w is a vector of parameters.
the co efficientt that we multiply by feature xi before summing up the con
coefficien
efficien contributions
tributions
P ∈
arameters are values that control the b ehavior of the system. In this case, whow
is
from all the features. We can think of w as a set of weights that determine
the
eac
each co efficien t that w e multiply by feature x b efore
h feature affects the prediction. If a feature xi receiv summing
receives up the con tributions
weightt wi ,
es a positive weigh
from all the features. We can think of w as a set of weights that determine how
each feature affects the prediction. If 107 a feature x receives a positive weight w ,
CHAPTER 5. MACHINE LEARNING BASICS

then increasing the value of that feature increases the value of our prediction yŷˆ.
If a feature receiv
receiveses a negative weigh eight,t, then increasing the value of that feature
decreases the value of our prediction. If aincreases
then increasing the value of that feature feature’s the weigh value
eight of ourinprediction
t is large magnitude, yˆ.
If a feature
then it has areceiv
largeeseffect
a negative
on the w eight, thenIfincreasing
prediction. a feature’sthe value
weigh
weight t is of thatit feature
zero, has no
decreases the value
effect on the prediction. of our prediction. If a feature’s w eigh t is large in magnitude,
then it has a large effect on the prediction. If a feature’s weight is zero, it has no
We thus hav havee a definition of our task T : to predict y from x by outputting
effect on > the prediction.
yŷˆ = w x. Next we need a definition of our performance measure, P .
We thus have a definition of our task T : to predict y from x by outputting
yˆ = Suppose
Supp
w xose that we ha hav ve a design matrix of m example inputs that we will not
. Next we need a definition of our performance measure, P .
use for training, only for ev evaluating
aluating ho how w well the model performs. We also hav havee
Suppose
a vector that we hatargets
of regression ve a design
pro matrix
providing
viding the of correct
m example valueinputsof y forthat eachwe of will not
these
use for training,
examples. Because only
thisfordataset
evaluating howbwell
will only e used thefor model
ev performs.
evaluation,
aluation, we callWe italso thehavteste
a
setvector
set. of regression
. We refer to the design targets
matrixproviding
of inputs theascorrect
X (test) vand aluethe y
of vector
for each of these
of regression
examples. Because
targets as y(test). this dataset will only b e used for ev aluation, we call it the test
set. We refer to the design matrix of inputs as X and the vector of regression
One way of measuring the performance of the mo model del is to compute the me meanan
targets as y . ( test )
squar
squareed err
erroror of the mo model del on the test set. If y ŷˆ giv
gives es the predictions of the
mo
modelOne w a y of measuring the p erformance
del on the test set, then the mean squared error is given of the mo del is to
bycompute the mean
squared error of the model on the test set. If yˆ gives the predictions of the
1 X
model on the test set, then MSEtestthe =mean squared (yŷˆ (test)error
− y (testis given
) i . by
) 2
(5.4)
m
1 i
MSE = ( yˆ y ) . (5.4)
In
Intuitiv
tuitiv
tuitively
ely m
ely,, one can see that this error measure decreases to 0 when y ŷˆ ( test ) = y test) .
(

We can also see that
Intuitively, one can see that this error measure decreases to 0 when yˆ =y .
We can also see that 1 ( test ) ( test ) 2
MSE test = X ||
||ˆ
ŷˆ
y −y ||2 , (5.5)
m
1
so the error increases whenever MSE = Euclidean
the yˆ y
distance , bet
etween
ween the predictions (5.5)
m
and the targets increases. || − ||
so the error increases whenever the Euclidean distance between the predictions
andTtheo make
targetsa mac
machine
hine learning algorithm, we need to design an algorithm that
increases.
will improv
improvee the weighweights ts w in a way that reduces MSEtest when the algorithm
T
is alloo
allow make a mac
wed to gain exp hine learning
experience
erience by algorithm,
observing awe need toset
training design an),algorithm
(X (train y(train) ). One that
will
in improv
intuitiv
tuitiv
tuitivee wa
way eythe weights
of doing w in a wawe
this (which y that reduceslater,
will justify MSE in Sec. when5.5.1 the )algorithm
is just to
is allow ed to gain exp erience by observing
minimize the mean squared error on the training set, MSEtrain . a training set ( X , y ). One
intuitive way of doing this (which we will justify later, in Sec. 5.5.1) is just to
To minimize
minimize the mean MSE train, werror
squared e can onsimply solve forset,
the training where MSE its gradien
gradient
. t is 0:

To minimize MSE , we can simply solve for where its gradient is 0:


∇w MSEtrain = 0 (5.6)
1 (train) (train
⇒ ∇ w || ŷˆMSE − y=
||ˆ
y 0 )|| 22 = 0 (5.6)
(5.7)
m
1∇
1 yˆ y =0 (5.7)
⇒ ∇wm ||
||XX(train)w − y (train)|| 22 = 0 (5.8)
m
⇒1∇ || − ||
X 108
w y =0 (5.8)
m
⇒ ∇ || − ||
CHAPTER 5. MACHINE LEARNING BASICS

Linear regression example Optimization of w


3 0.55

2
Linear regression example 0.50 Optimization of w
0.45
1

MSE(train)
0.40
0
y

0.35
−1
0.30
−2 0.25
−3 0.20
−1.0 −0.5 0.0 0 .5 1.0 0.5 1.0 1.5
x1 w1

w
w
w y =w x w
w w
w y =w x
w
 >  
(train) (train) (train) (train)
⇒ ∇w X w−y X w−y =0 (5.9)
 
⇒ ∇w w> X (trainX X w) w y
)> (train
− 2w> X (train X )> y (train
w )+ =y0(train) =
y y (train)> (5.9)
0
⇒∇ − − (5.10)
w X X w 2w X y +y y =0
⇒ 2X (train)> X (train)w − 2X (train)> y(train) = 0 (5.11)
⇒∇  − (5.10)
 ( ) >
 −1 (train)> 
train (train) (train)
⇒ 2wX= X X Xw 2X X y y = 0 (5.11)
(5.12)
 ⇒ − 
w = Xwhose solution
The system of equations X Xen by Eq.
is giv
given y 5.12 is known (5.12) as the
quations..⇒Ev
normal equations Evaluating
aluating Eq. 5.12 constitutes a simple learning algorithm.
For The system of
an example of equations
the linear whose
regression solution is giv
learning en by Eq.in5.12
algorithm is known
action, see Fig.as 5.1
the.
normal equations. Evaluating Eq. 5.12 constitutes a simple learning algorithm.
For Itanisexample
worth noting
of the that  the
linear term line
regression linear
ar regr
learning gression
ession is often
algorithm used to
in action, seerefer
Fig. to5.1a.
sligh
slightly
tly more sophisticated mo modeldel with one additional parameter—an intercept
It is w orth
term b. In this mo noting
model
del that the term linear regression is often used to refer to a
slightly more sophisticated modelyŷˆ with = w >one
x +additional
b parameter—an intercept (5.13)
term b. In this model
so the mapping from parameters yˆto=predictions w x + b is still a linear function but the
(5.13)
mapping from features to predictions is now an affine function. This extension to
so the functions
affine mapping means
from parameters
that the plot to predictions
of the momodel’sis still
del’s a linear function
predictions still lo oksbut
looks the
like a
mapping from features to predictions is now an affine function.
line, but it need not pass through the origin. Instead of adding the bias parameter This extension to
baffine
, one functions
can contin means
continue thatthe
ue to use themo plot
modeldel of theonly
with model’sweighpredictions
eights
ts but augmen still tloxoks
augment withlikeana
line, but it need not pass through the origin. Instead of adding the bias parameter
b, one can continue to use the model with 109 only weights but augment x with an
CHAPTER 5. MACHINE LEARNING BASICS

extra entry that is alwa


alwaysys set to 1. The weigh eightt corresp
corresponding
onding to the extra 1 entry
pla
plays
ys the role of the bias parameter. We will frequen frequently
tly use the term “linear” when
extra entry
referring to that
affineisfunctions
always setthroughout
to 1. The this
weigh bto ok.
corresponding to the extra 1 entry
plays the role of the bias parameter. We will frequently use the term “linear” when
The in
referringintercept
totercept
affine term b is often
functions called this
throughout the bias
bo ok.parameter of the affine transfor-
mation. This terminology derives from the point of view that the output of the
The intercept
transformation is term
biased b isto
tow
woften
ard bcalled
eing bthe biasabsence
in the parameter ofythe
of an
any affine
input. transfor-
This term
mation.
is differenThis terminology derives from the point of view that the
differentt from the idea of a statistical bias, in which a statistical estimation output of the
transformation
algorithm’s exp is biased
expected
ected toward
estimate of abeing b
quantit
quantityiny the absence
is not equal ofto an
they true
input. This yterm
quantit
quantity .
is different from the idea of a statistical bias, in which a statistical estimation
Linear regression
algorithm’s expectedisestimate
of courseofanaextremely
quantity is simple and limited
not equal to the learning algorithm,
true quantit y.
but it provides an example of how a learning algorithm can work. In the subsequen subsequentt
Linear
sections weregression is of
will describ
describe courseofan
e some theextremely simple and
basic principles limited learning
underlying learning algorithm,
algorithm
but it provides an example
design and demonstrate ho how w these principles can be used to build more complicatedt
of how a learning algorithm can work. In the subsequen
sections we will describ
learning algorithms. e some of the basic principles underlying learning algorithm
design and demonstrate how these principles can be used to build more complicated
learning algorithms.
5.2 Capacit
Capacity
y, Ov
Overfitting
erfitting and Underfitting

5.2 cen
The Capacit
central
tral challengey, inOv erfitting
mac
machine
hine learning and Underfitting
is that we must perform well on
inputs—not just those on whic which h our mo del was trained. The
model
The
abilitcen
ability tral challenge in mac hine learning
y to perform well on previously unobserv is
unobserved that we m ust perform
ed inputs is called gener well on
generalization
alization
alization. .
inputs—not just those on which our mo del was trained. The
Typically
abilit ypically,, whenwell
y to perform training a machineunobserv
on previously learninged mo
model,
del, weishav
inputs have e access
called genertoalization
a training.
set, we can compute some error measure on the training set called the tr training
aining
err
error
orT,ypically
or, and we, when
reducetraining a machine
this training error.learning
So far,mo del, w
what we hav
e ha
havvee describ
access to
describededaistraining
simply
set, w e can compute some error measure on the training
an optimization problem. What separates machine learning from optimization set called the training is
error , and
that we wan w e reduce
wantt the gener this training
generalization
alization err error.
error So far, what w
or, also called the test err e ha ve
error describ ed
or, to be lo lowis simply
w as well.
an optimization problem. What separates
The generalization error is defined as the exp machine
expected learning from optimization
ected value of the error on a new is
that weHere
input. wantthetheexp
gener alization
expectation
ectation error,across
is taken also called
differen
different the test errorinputs,
t possible , to be dralowwn
drawn as from
well.
The generalization
the distribution error iswedefined
of inputs expect as thethe expected
system valueter
to encoun
encounter of in
thepractice.
error on a new
input. Here the expectation is taken across different possible inputs, drawn from
We typically estimate the generalization error of a machine learning mo modeldel by
the distribution of inputs we expect the system to encounter in practice.
measuring its performance on a test set of examples that were collected separately
fromWthee typically
trainingestimate
set. the generalization error of a machine learning model by
measuring its performance on a test set of examples that were collected separately
fromIntheourtraining
linear regression
set. example, we trained the model by minimizing the
training error,
In our linear regression 1example, we trained
(train)
the model by minimizing the
(train) 2
training error, ||
||XX w − y ||2, (5.14)
m(train)
1
but we actually care ab about
out the testX error,w 1 y || X (test),w − y (test) ||22.
||X (5.14)
m m
|| − set when || we get to observ
but Ho
How
wewactually
can we affect
care abperformance
out the testonerror,the test X w y observe . e only the
training set? The field of statistic
statisticalal le
learning
arning thetheory
ory pro
provides
vides some answanswers.
ers. If the
How can we affect performance on the test set|| when we − || e only the
get to observ
training set? The field of statistical learning 110 theory provides some answers. If the
CHAPTER 5. MACHINE LEARNING BASICS

training and the test set are collected arbitrarily arbitrarily,, there is indeed little we can do.
If we are allo allowed
wed to make some assumptions about ho how w the training and test set
training and the test
are collected, then we can mak set are collected arbitrarily
makee some progress. , there is indeed little we can do.
If we are allowed to make some assumptions about how the training and test set
are The train and
collected, thentest
wedata are generated
can mak by a probability distribution ov
e some progress. over
er datasets
called the data gener generating
ating pr proocess
ess.. We typically mak makee a set of assumptions kno knownwn
The train
collectiv
collectively ely asand thetest dataassumptions
i.i.d. are generated by a probability
These assumptions distribution
are that the overexamples
datasets
called
in eac
each hthe data gener
dataset atingendent
are indep processfrom
independent . We each
typically
other,mak ande athat
set oftheassumptions
train set and knotest
wn
collectiv
set ely as al
are identic
identical the
ally i.i.d.
ly distribute
distributedassumptions
d, dra
drawn Thesethe
wn from assumptions
same probabilityare thatdistribution
the examples as
in
eac
eacheac h dataset are indep
h other. This assumption allo endent from
allows each other, and that the train
ws us to describe the data generating process set and test
set are
with identical lyy distribution
a probabilit
probability distributed, dra ov
overwna from
er singlethe same probability
example. The same distribution
distribution as is
eac h other. This
then used to generate ev assumption
every allo ws us to describe the data
ery train example and every test example. We call that generating process
with a probabilit y distribution
shared underlying distribution the ovdata
er a gener
singleating
example.
generating The same
distribution
distribution, distribution
, denoted p data. Thisis
then used to generate ev ery train example
probabilistic framework and the i.i.d. assumptions allo and every test
allow example. W e
w us to mathematically call that
sharedthe
study underlying distribution
relationship betw eenthe
etween data gener
training errorating test error. , denoted p
and distribution . This
probabilistic framework and the i.i.d. assumptions allow us to mathematically
studyOne theimmediate
relationship connection
betweenwe can observe
training betw
between
error and een error.
test the training and test error
is that the expected training error of a randomly selected mo modeldel is equal to the
exp One
expected immediate connection
ected test error of that mo we
model. can observe
del. Suppose we ha betw een
havve a probabilityand
the training test error
distribution
pis(xthat the expected
, y ) and we sampletraining
from it reperror of a randomly
repeatedly
eatedly to generateselected
the train mosetdeland
is equal to the
the test set.
exp ected test error of
For some fixed value w , then the exp that mo del. Suppose
expected w e ha v e a probability
ected training set error is exactly the same as distribution
p (x ,
the exp y ) and
ected test set error, b ecause both to
expected w e sample from it rep eatedly expgenerate
expectations
ectations thearetrain
formedset and
usingthethe
test set.
same
F or some fixed v alue w , then the
dataset sampling process. The only difference betw exp ected training set
etween error
een the tw is
twoexactly the same
o conditions is the as
the exp
name wected
e assign testtoset
theerror, b ecause
dataset both expectations are formed using the same
we sample.
dataset sampling process. The only difference between the two conditions is the
name Ofwcourse,
e assignwhen to thew we e use awe
dataset machine
sample.learning algorithm, we do not fix the
parameters ahead of time, then sample b oth datasets. We sample the training set,
thenOfuse course, when the
it to choose we parameters
use a machine learning
to reduce algorithm,
training set error, wethen
do not fix the
sample the
parameters
test set. Under aheadthisof time,
pro
process,thenthe
cess, sample
exp
expectedb othtest
ected datasets.
error is Wegreater
samplethan the training
or equalset,to
then
the exp use
expected it to choose the parameters to reduce
ected value of training error. The factors determining ho training set error,
how then sample
w well a machine the
test set. Under this pro cess,
learning algorithm will perform are its abilit the exp ected test
abilityy to:error is greater than or equal to
the expected value of training error. The factors determining how well a machine
learning
1. Mak
Make algorithm
e the trainingwill perror
erform are its ability to:
small.

1. Mak
2. training
Makee the gap betwerror
etween small. and test error small.
een training

2.These
Maketwthe gap betw
o factors een ond
corresp training
correspond to theand
twotest
cenerror
central small.
tral challenges in machine learning:
underfitting and overfitting. Underfitting occurs when the model is not able to
These
obtain two factors
a sufficien
sufficiently corresp
tly lo
low
w errorond to the
value on tthe
wo training
central challenges
set. Ov in machine
Overfitting
erfitting learning:
occurs when
underfitting
the gap betw and
etween overfitting . Underfitting occurs
een the training error and test error is to when
too the model
o large. is not able to
obtain a sufficiently low error value on the training set. Overfitting occurs when
We can con
control
trol whether a momodel
del is more lik
likely
ely to ov
overfit
erfit or underfit by altering
the gap between the training error and test error is too large.
its cap
apacity
acity
acity.. Informally
Informally,, a mo
model’s
del’s capacit
capacity y is its abilit
ability
y to fit a wide variety of
We can control whether a model is more likely to overfit or underfit by altering
its capacity. Informally, a mo del’s capacit111 y is its ability to fit a wide variety of
CHAPTER 5. MACHINE LEARNING BASICS

functions. Mo Models
dels with lo low
w capacit
capacity y ma
may y struggle to fit the training set. Mo Models
dels
with high capacit
capacity y can overfit by memorizing properties of the training set that do
functions.
not serv Models
servee them wellwith lowtest
on the capacit
set.y may struggle to fit the training set. Models
with high capacity can overfit by memorizing properties of the training set that do
not One
servewa
wayy towell
them con
control
trol
on thethetest
capacity
set. of a learning algorithm is by choosing its
hyp
hypothesis
othesis spspac
ac
acee, the set of functions that the learning algorithm is allow allowed
ed to
One
select as wa y tothe
being consolution.
trol the F capacity
or example, of a the
learning
linear algorithm
regression is by choosing
algorithm its
has the
hypothesis
set spacefunctions
of all linear , the set of functions
of its input as that
its the learningspace.
hypothesis algorithm
We canis allow ed to
generalize
select as
linear being the
regression tosolution. For example,rather
include polynomials, the linear
thanregression
just linearalgorithm
functions,has
inthe
its
set
hyp of all
ypothesis linear functions of its input
othesis space. Doing so increases the mo as its hypothesis
model’s
del’s capacitspace.
capacityy. W e can generalize
linear regression to include polynomials, rather than just linear functions, in its
hypA polynomial
othesis space. ofDoing
degreeso one giv
gives
es us
increases thethemolinear regression
del’s capacit y. mo
model
del with whic
whichh we
are already familiar, with prediction
A polynomial of degree one gives us the linear regression model with which we
are already familiar, with prediction yŷˆ = b + wx. (5.15)

By in
intro
tro ducing x2
troducing as another featureyˆ =provided (5.15)
b + wx. to the linear regression model, we
can learn a mo
model
del that is quadratic as a function of x:
By introducing x as another feature provided to the linear regression model, we
can learn a model that is quadratic
yŷˆ = b + x + w2 x2 of
aswa1function . x: (5.16)

Though this mo model


del implements yˆ = b + w x + function
a quadratic w x . of its (5.16)
, the output is
still a linear function of the , so we can still use the normal equations
Though
to train this
the momodel
del in
model implements
closed form. a quadratic function
We can contin
continue of add
ue to its more ,pthe
ow output
owers
ers is
of x as
still a linearfeatures,
additional functionfor
of example
the , so
to obtain a we can still of
polynomial usedegree
the normal
9: equations
to train the model in closed form. We can continue to add more powers of x as
additional features, for example to obtain X 9 a polynomial of degree 9:
yŷˆ = b + wi xi . (5.17)
i=1
yˆ = b + wx. (5.17)
Mac hine learning algorithms will generally perform best when their capacity
Machine
is appropriate in regard to the true complexit
complexity y of the task they need to perform
Mac hine learning algorithms
and the amount of training data they are will generally perform
provided with.bModels
est when their
with capacityt
insufficien
insufficient
X
is appropriate
capacit
capacity in regard
y are unable to solv to the true complexit
solvee complex tasks. Mo y of
Models the task they need
dels with high capacit
capacity to perform
y can solve
and the amount of training data
complex tasks, but when their capacit they
capacityare provided with. Models with insufficien
presenttt
y is higher than needed to solve the presen
capacit
task y are
they may unable
ov
overfit.to solve complex tasks. Models with high capacity can solve
erfit.
complex tasks, but when their capacity is higher than needed to solve the present
Fig. 5.2 sho
shows
ws this principle in action. We compare a linear, quadratic and
task they may overfit.
degree-9 predictor attempting to fit a problem where the true underlying function
Fig. 5.2 sho
is quadratic. Thewslinear
this principle
function isinunable
action.toW e compare
capture a linear,
the curv ature quadratic
curvature in the trueand
un-
degree-9 problem,
derlying predictor soattempting
it underfits.to fit
Thea degree-9
problem where theistrue
predictor underlying
capable function
of represen
representing
ting
is quadratic.
the The linear
correct function, butfunction is unable
it is also capabletoofcapture the curv
representing ature in many
infinitely the true un-
other
derlying problem, so it underfits. The degree-9
functions that pass exactly through the training p oin predictor is
oints, capable of represen
ts, because we hav ting
havee more
the correct function, but it is also capable of representing infinitely many other
functions that pass exactly through the 112training p oints, b ecause we have more
CHAPTER 5. MACHINE LEARNING BASICS

parameters than training examples. We hav havee little chance of choosing a solution
that generalizes well when so man
many y wildly different solutions exist. In this example,
parameters
the than
quadratic mo training
model examples.
del is perfectly We hav
matched toe the
littletrue
chance of choosing
structure of the atask
solution
so it
that generalizes well when so
generalizes well to new data. man y wildly different solutions exist. In this example,
the quadratic model is perfectly matched to the true structure of the task so it
generalizes well to new data.

x y

x y

So far we hav havee only describ


described
ed changing a mo model’s
del’s capacity by changing the
num
umb ber of input features it has (and simultaneously adding new parameters
asso So far w
associated
ciated e hav
with e only
those describThere
features). ed changing
are in facta mo del’s wcapacity
many by changing
ays of changing a mo the
model’s
del’s
numbery.ofCapacit
capacit
capacity input yfeatures
Capacity it has (andonly
is not determined simultaneously
by the choice adding new parameters
of model. The momodel
del
asso
sp ciated
specifies with
ecifies whic
which those features). There are in fact many w
h family of functions the learning algorithm can cho ays of changing
hoose a mo
ose from whendel’s
capacit y. Capacit y is not determined only
varying the parameters in order to reduce a training ob by the choice
objectiv of
jectiv
jective.e. This isThe
model. mothe
called del
rsp ecifies
epr which family
epresentational
esentational cap
apacityof functions
acity of the mo the
del. learning
model. In man
manyyalgorithm can cthe
cases, finding hoose
bestfrom when
function
varyingthis
within thefamily
parameters in order
is a very to optimization
difficult reduce a training ob jectiv
problem. e. This is
In practice, called
the the
learning
representational
algorithm do
does cap
es not acity offind
actually thethe
mobdel. In many but
est function, cases, finding
merely onethe
thatbest function
significantly
within this family is a very difficult optimization problem.
reduces the training error. These additional limitations, such as the impIn practice, the learning
imperfection
erfection
algorithm do es not actually find the best function, but merely one that significantly
reduces the training error. These additional 113 limitations, such as the imp erfection
CHAPTER 5. MACHINE LEARNING BASICS

of the optimization algorithm, mean that the learning algorithm’s effe effective
ctive cap apacity
acity
ma
may y be less than the representational capacit capacity y of the mo model
del family
family..
of the optimization algorithm, mean that the learning algorithm’s effective capacity
Our mo modern
dern ideas ab about
out impro
improvingving the generalization of mac machine
hine learning
may be less than the representational capacity of the model family.
mo
models
dels are refinements of though thoughtt dating bac back k to philosophers at least as early
Our
as Ptolem
Ptolemy mo dern
y. ManMany ideas ab out impro
y early scholars inv ving
invokok the generalization
okee a principle of parsimony of machine thatlearning
is now
models
most are refinements
widely kno
known wn as Oc ofcthough
Occ t dating
am’s razor back to philosophers
(c. 1287-1347). This principleat least as early
states that
as Ptolem
among comp y .
competing Man y
eting hyp early
ypotheses scholars inv
otheses that explain knook e a principle
knownwn observ of ations equally well, now
parsimony
observations that is one
most widely
should cho hoose kno wn as Oc cam’s razor (c. 1287-1347).
ose the “simplest” one. This idea was formalized and made more preciseThis principle states that
among comp
in the 20th century eting hyp byotheses that explain
the founders known observ
of statistical ations
learning equally
theory well, and
(Vapnik one
shouldonenkis
Cherv choose, the
Chervonenkis 1971“simplest”
; Vapnik,one. 1982This
; Blumeridea et wasal.
formalized
al.,, 1989; Vapnikand ,made
1995). more precise
in the 20th century by the founders of statistical learning theory (Vapnik and
Statistical learning theory provides various means of quan quantifying
tifying momodeldel capacity
capacity..
Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995).
Among these, the most well-kno well-known wn is the Vapnik-Chervonenkis dimension dimension,, or VC
dimension. The VC dimension measures the capacity of a binary mo
Statistical learning theory provides v arious means of quan tifying del capacity
classifier. The.
Among
V these, the
C dimension most well-kno
is defined as being wnthe is the Vapnik-Chervonenkis
largest possible value of m dimension
for whic
which , or
h VC
there
dimension.
exists The set
a training VC of dimension
m differen
different measures
t x poin ointststhe capacity
that of a binary
the classifier can lab classifier.
label
el arbitrarily The.
arbitrarily.
VC dimension is defined as being the largest possible value of m for which there
Quan
Quantifying
tifying the capacitcapacity y of the mo modeldel alloallows
ws statistical learning theory to
exists a training set of m different x points that the classifier can label arbitrarily.
mak
makee quan
quantitativ
titativ
titativee predictions. The most imp important
ortant results in statistical learning
Quan
theory shotifying
show w thatthe thecapacit y of the
discrepancy betw mo
etween deltraining
een allows statistical learning theory
error and generalization to
error
mak
is e quantitativ
bounded from ab e predictions.
abovov
ovee by a quantit The ymost
quantity that imp gro
growsortant
ws results
as the mo
model in capacity
del statisticalgro learning
growsws but
theory sho w
shrinks as the num that the
umb discrepancy b etw een training error
ber of training examples increases (Vapnik and Cherv and generalization
Chervonenkis error,
onenkis
onenkis,
is bounded
1971 ; Vapnik from
, 1982ab;ovBlumer
e by a quantit al.,, y1989
et al. that ;V gro ws as
apnik the ).
, 1995 moThese
del capacity
boundsgro ws but
provide
shrinks
in as the
intellectual
tellectual number ofthat
justification training
machineexamples
learning increases
algorithms (Vapnik and Cherv
can work, onenkis
but they are,
1971
rarely; Vused
apnikin, 1982 ; Blumer
practice when etworking
al., 1989 ; Vapnik
with deep ,learning
1995). These boundsThis
algorithms. provide
is in
in tellectual justification that
part because the bounds are often quite lo machine learning
loose algorithms can work, but
ose and in part because it can be quite they are
difficult to determine the capacity of deep learning algorithms. The problem in
rarely used in practice when working with deep learning algorithms. This is of
part becausethe
determining thecapacity
bounds of area often quite loose
deep learning mo
modelandisinesp
del part because
especially
ecially difficultit can be quite
because the
difficult
effectivee to
effectiv determine
capacity the capacity
is limited of deep learning
by the capabilities of the algorithms.
optimizationThe problemand
algorithm, of
determining
w e ha
havve littlethe capacityunderstanding
theoretical of a deep learning of the movery
del general
is esp ecially difficult
non-con
non-conv because the
vex optimization
effectiv e
problems in capacity
invvolv
olved is limited b
ed in deep learning.y the capabilities of the optimization algorithm, and
we have little theoretical understanding of the very general non-convex optimization
We must remem rememb ber that while simpler functions are more likely to generalize
problems involved in deep learning.
(to hav
havee a small gap betw etweeneen training and test error) we must still choose a
W
sufficien e
sufficiently must remem
tly complex hyp b er that
ypothesis while
othesis to simpler
ac hievee functions
achiev
hiev lo
loww training areerror.
more likely
Typicallyto generalize
ypically, , training
(to hav e a small
error decreases un gap
til it asymptotes to the minimum possible error value choose
until b etw een training and test error) we m ust still as mo model a
del
sufficien
capacit
capacity ytly complex
increases hypothesis
(assuming thetoerror
achiev e low training
measure has a minim error.
minimum umTvypically , training,
alue). Typically
Typically,
error decreaseserror
generalization untilhasit asymptotes
a U-shaped to thee minimum
curv
curve as a function possible errorcapacit
of model value yas
capacity model
. This is
capacit y increases
illustrated in Fig. 5.3. (assuming the error measure has a minim um value). Typically ,
generalization error has a U-shaped curve as a function of model capacity. This is
To reach
illustrated in the
Fig. most
5.3. extreme case of arbitrarily high capacit capacity y, we in intro
tro
troduce
duce
the concept of non-p non-par ar
arametric
ametric models. So far, w wee hav
havee seen only parametric
To reach the most extreme case of arbitrarily high capacity, we intro duce
the concept of non-parametric models. 114 So far, we have seen only parametric
CHAPTER 5. MACHINE LEARNING BASICS

mo
models,
dels, suc
suchh as linear regression. Parametric mo models
dels learn a function describdescribed ed
by a parameter vector whose size is finite and fixed before any data is observed.
models, such as mo
Non-parametric linear
delsregression.
models hav
havee no suc Pharametric
such limitation. models learn a function described
by a parameter vector whose size is finite and fixed before any data is observed.
Sometimes, non-parametric
Non-parametric models have no models are just theoretical abstractions (suc
such limitation. (such h as
an algorithm that searches over all possible probability distributions) that cannot
Sometimes, non-parametric
be implemented in practice. How models
ever, weare
However, canjust
alsotheoretical abstractions
design practical (such as
non-parametric
an
mo algorithm
models
dels that their
by making searches over all
complexit
complexity y apossible
functionprobability
of the trainingdistributions)
set size. Onethatexample
cannot
be such
of implemented in practice.
an algorithm is ne
near
ar
arestHow
est ever,orwerecan
neighb
neighbor gr also .design
gression
ession
ession. Unlikeepractical
Unlik non-parametric
linear regression, whic
whichh
mo dels b y making their
has a fixed-length vector of weighcomplexit y
eights,a function of the training set size.
ts, the nearest neighbor regression model simply One example
stores the X and y from the trainingorset.
of such an algorithm is ne ar est neighb regrWhen
ession.ask Unlik
askeded toe linear regression,
classify a test point whicxh,
has mo
the a fixed-length
model
del lo oks upvector
looks of weigh
the nearest en ts, the
entry
try in thenearest neighbor
training set and regression
returns the model
asso simply
associated
ciated
stores the X
regression and yInfrom
target. thewords,
other trainingyŷˆ = set.y When
where ask
i =edarg
to min
classify
||
||XX a− test
x point
||2 x,
i i,: 2 The
.
the model can
algorithm looks upbthe
also nearest ento
e generalized trydistance
in the training
metrics set andthan
other returns
the
theLLthe associated
2 norm, such
regression target. In other
as learned distance metrics (Goldb words,
Goldbergery
ˆ = y where i = arg min X
erger et al., 2005). If the algorithm is allo x .
allow The
wed
algorithm can also b e generalized to distance metrics other than
to break ties by averaging the yi values for all Xi,: that are tied for nearest, then ||the L − norm,
|| such
as learned
this distance
algorithm is ablemetrics
to achiev(Goldb
achieve e theerger
minim
minimum etumal.p, ossible
2005). training
If the algorithm
error (whicis hallo
(which wedt
migh
might
toe break
b greaterties by zero,
than averaging
if twothe y tical
iden values
identical inputs areXasso
for all that
associated
ciatedarewith
tied differen
for nearest,
different then
t outputs)
thisan
on algorithm
any is able
y regression to achieve the minimum p ossible training error (which might
dataset.
be greater than zero, if two identical inputs are associated with different outputs)
Finally
Finally,, we can also create a non-parametric learning algorithm by wrapping a
on any regression dataset.
parametric learning algorithm inside another algorithm that increases the num numb ber
Finally, we can also create a non-parametric learning algorithm by wrapping a
parametric learning algorithm inside another 115 algorithm that increases the number
CHAPTER 5. MACHINE LEARNING BASICS

of parameters as needed. For example, we could imagine an outer lo loop


op of learning
that changes the degree of the polynomial learned by linear regression on top of a
ofolynomial
p parameters as needed.
expansion For input.
of the example, we could imagine an outer loop of learning
that changes the degree of the polynomial learned by linear regression on top of a
The idealexpansion
polynomial mo
model
del is an
of oracle that simply knows the true probability distribution
the input.
that generates the data. Even suc such h a mo
model
del will still incur some error on man manyy
The ideal
problems, modelthere
because is an oracle
ma
may thatbsimply
y still e some knows
noise in thethe
true probability distribution
distribution. In the case
that
of generateslearning,
supervised the data.the Even such afrom
mapping model willy still
x to ma
may yincur some errorsto
be inherently on
stoc many
chastic,
problems,
or y ma
may y bbeecause there mayfunction
a deterministic still be some noise
that inv
involv
olv in other
olves
es the distribution. In thethose
variables besides case
of supervised learning, the mapping from x to y ma y be inherently
included in x. The error incurred by an oracle making predictions from the true sto chastic,
or y may be pa(xdeterministic
distribution function
, y) is called the Bayesthaterr inv
error
or
or.. olves other variables besides those
included in x. The error incurred by an oracle making predictions from the true
Training and generalization error vary as the size of the training set varies.
distribution p(x, y) is called the Bayes error.
Exp
Expected
ected generalization error can nev never
er increase as the num numb ber of training examples
Training and generalization
increases. For non-parametric mo error
models, vary as the size
dels, more data yields better of the generalization
training set varies.
until
Expected
the best pgeneralization
ossible error error caned.
is achiev
achieved.never
Anyincrease
fixed as the number
parametric moofdel
modeltraining examples
with less than
increases.
optimal For non-parametric
capacit
capacity y will asymptotemo todels, morevdata
an error yields
alue that betterthe
exceeds generalization
Bay es error.until
Bayes See
the b est p ossible error is achiev ed. Any fixed parametric
Fig. 5.4 for an illustration. Note that it is possible for the mo mo
modeldel with
del to hav less than
havee optimal
optimal
capacit
capacity capacit y will asymptote
y and yet still hav to an error
havee a large gap b et etwvalue that exceeds the Bay
ween training and generalization es error.error.
See
Fig.this
In 5.4situation,
for an illustration.
we may beNote ablethat it is possible
to reduce this gap forbthe model tomore
y gathering have training
optimal
capacity and yet still have a large gap b etween training and generalization error.
examples.
In this situation, we may be able to reduce this gap by gathering more training
examples.
5.2.1 The No Free Lunc
Lunch
h Theorem

5.2.1 The
Learning theoryNo Freethat
claims Lunc h Theorem
a machine learning algorithm can generalize well from
a finite training set of examples. This seems to con contradict
tradict some basic principles of
Learning
logic. theory
Inductiv
Inductive claims thatorainferring
e reasoning, machine general
learningrulesalgorithm
from acan generalize
limited well from
set of examples,
a finite
is training vset
not logically of examples.
alid. T
To
o logically This
inferseems
a ruleto describing
contradict someev erybasic
every principles
member of
of a set,
logic.must
one Inductiv
haveeeinformation
hav reasoning, orabout
inferring
every general
mem
memb brules
er of from a limited set of examples,
that set.
is not logically valid. To logically infer a rule describing every member of a set,
In part, mac hine learning av
machine oids this problem by offering only probabilistic rules,
avoids
one must have information about every member of that set.
rather than the entirely certain rules used in purely logical reasoning. Machine
In part,
learning machinetolearning
promises avoids
find rules thatthis
areproblem
pr
prob
ob ablybycorrect
obably offeringab only
out probabilistic
about most members rules,
of
rather than the
the set they concern.entirely certain rules used in purely logical reasoning. Machine
learning promises to find rules that are probably correct about most members of
Unfortunately
Unfortunately,, even this do es not resolve the en
does tire problem. The no fr
entire freee lunch
the set they concern.
the
theor
or
orem
em for machine learning (Wolp olpert
ert, 1996) states that, averaged ov over
er all possible
data generating distributions, every classification algorithm has thenosame
Unfortunately , even this do es not resolve the en tire problem. The free lunch
error
the orem for machine learning ( W olp
rate when classifying previously unobserv ert , 1996
unobserved ) states that, av eraged ov er all
ed points. In other words, in some sense, p ossible
data generating distributions, every
no machine learning algorithm is universally anclassification
any algorithm
y better than anyhasother.
the same
The error
most
rate when classifying
sophisticated algorithm previously unobserv
we can conceive ofedhas
points.
the sameIn other words,
average in some sense,
performance (o
(ov
ver
no machine learning algorithm is universally an y b etter than any
all possible tasks) as merely predicting that every point belongs to the same class. other. The most
sophisticated algorithm we can conceive of has the same average performance (over
all possible tasks) as merely predicting 116 that every point belongs to the same class.
CHAPTER 5. MACHINE LEARNING BASICS

117
CHAPTER 5. MACHINE LEARNING BASICS

Fortunately
ortunately,, these results hold only when we average ov over
er al
alll possible data
generating distributions. If we mak makee assumptions about the kinds of probability
Fortunately
distributions we, encounter
these results hold orld
in real-w only applications,
real-world when we average over
then we canal ldesign
possible data
learning
generating that
algorithms distributions. If won
perform well e mak e assumptions
these distributions.about the kinds of probability
distributions we encounter in real-world applications, then we can design learning
This means that the goal of mac machine
hine learning research is not to seek a universal
algorithms that perform well on these distributions.
learning algorithm or the absolute best learning algorithm. Instead, our goal is to
This means
understand that
what theof
kinds goal of machineare
distributions learning
relev research
relevant
ant to theis“real
not to seek that
world” a universal
an AI
learning
agen algorithm
agentt exp
experiences, or the absolute b
eriences, and what kinds of macest learning
machine algorithm. Instead, our goal
hine learning algorithms perform well is on
to
understand
data dra
drawn what kinds of distributions are relevant to the “real
wn from the kinds of data generating distributions we care ab world” that
about.
out. an AI
agent experiences, and what kinds of machine learning algorithms perform well on
data drawn from the kinds of data generating distributions we care about.
5.2.2 Regularization

5.2.2no free
The Regularization
lunc
lunch h theorem implies that we must design our mac machinehine learning
algorithms to perform well on a sp specific
ecific task. We do so by building a set of
The no free
preferences in lunc
into h theorem implies
to the learning algorithm. When that we mtheseust design our mac
preferences arehine learning
aligned with
algorithms to perform w ell on a
the learning problems we ask the algorithm to solvsp ecific task. W
solve,e do so by building
e, it performs better. a set of
preferences into the learning algorithm. When these preferences are aligned with
the So far, theproblems
learning only metho
method
we dask of the
mo
modifying
difying
algorithm a learning
to solve, algorithm
it performs we hahave
ve discussed is
better.
to increase or decrease the mo model’s
del’s capacit
capacity y by adding or remo removing
ving functions from
the hypothesis space of solutions the learning algorithm is able to ha
So far, the only metho d of mo difying a learning algorithm we ve discussed
choose. We gagavvise
to increase
the sp ecificor
specific decreaseofthe
example model’s or
increasing capacit y by adding
decreasing or remo
the degree of ving functions for
a polynomial from a
the hypothesis
regression space The
problem. of solutions
view wethe ha
havvlearning
e described algorithm
so far isis able to choose. We gave
oversimplified.
the specific example of increasing or decreasing the degree of a polynomial for a
The b eha
regression ehavior
vior of The
problem. our algorithm
view we haisvestrongly
described affected
so far isnot ovjust by how large we
ersimplified.
mak
makee the set of functions allow allowed ed in its hypypothesis
othesis space, but by the sp specific
ecific iden
identit
tit
tityy
The b eha vior of our algorithm
of those functions. The learning algorithm we ha is strongly havaffected not just by how large
ve studied so far, linear regression, we
makae the
has set of functions
hypothesis allowed inofitsthe
space consisting hyp othesis
set space,
of linear but by the
functions specific
of its input.iden tity
These
of thosefunctions
linear functions. canThe be learning
very useful algorithm we havewhere
for problems studied thesorelationship
far, linear regression,
betw
etween
een
has a hypothesis space consisting of the set of linear functions
inputs and outputs truly is close to linear. They are less useful for problems of its input. These
linear
that bfunctions
eha
ehav ve in acan verybe nonlinear
very useful for problems
fashion. where the
For example, relationship
linear regression betw een
would
inputs
not and outputs
perform very welltrulyif weistried
closetotouse linear. They are
it to predict sin((less
sin useful
x ) from x . for
We problems
can thus
that
con
controlb eha v e in a very nonlinear
trol the performance of our algorithms by chofashion. F or example,
hoosing linear regression
osing what kind of functions would
we
not
allo
allowp erform
w them to dra v ery
draw w ell if w e tried to use it
w solutions from, as well as by con to predict sin (x
controlling ) from
trolling the amounx . W e can
amountt of thesethus
con trol
functions.the p erformance of our algorithms b y c ho osing what kind of functions we
allow them to draw solutions from, as well as by controlling the amount of these
We can also give a learning algorithm a preference for one solution in its
functions.
hyp
ypothesis
othesis space to another. This means that both functions are eligible, but one
W e can also
is preferred. Thegive a learning
unpreferred algorithm
solution a preference
be chosen only if itfor fitsone
thesolution
trainingindataits
hypothesis
significan
significantlytlyspace
better to than
another. This means
the preferred that both functions are eligible, but one
solution.
is preferred. The unpreferred solution be chosen only if it fits the training data
For example, w wee can modify the training criterion for linear regression to
significantly better than the preferred solution.
include weight de deccay
ay.. To perform linear regression with weigh weightt deca
decay y, we minimize
For example, we can modify the training criterion for linear regression to
include weight decay. To perform linear118 regression with weight decay, we minimize
CHAPTER 5. MACHINE LEARNING BASICS

a sum comprising both the mean squared error on the training and a criterion
J (w) that expresses a preference for the weigh weights havee smaller squared L2 norm.
ts to hav
a
Sp sum comprising
Specifically
ecifically
ecifically, , both the mean squared error on the training and a criterion
J (w) that expresses a preference forMSE
the weigh ts to>have smaller squared L (5.18) norm.
J (w) = train + λw w ,
Specifically,
where λ is a value chosen ahead J (w)of=time
MSE that con
controls
+ λtrols
w w the
, strength of our preference(5.18)
for smaller weigh ts. When λ = 00,, we imp
weights. ose no preference, and larger λ forces the
impose
where
w eigh tsλ to
eights is abvecome
alue chosen ahead
smaller. of time that
Minimizing J (wcon trols theinstrength
) results a choiceofofour preference
weigh
weightsts that
for
mak smaller weigh
makee a tradeoff bet ts.
etw When λ = 0 , we imp ose no preference, and larger
ween fitting the training data and being small. This giv λ forces esthe
gives us
w eights tothat
solutions become
havee smaller.
hav a smallerMinimizing
slope, or put J (w ) results
weigh
eight in a cof
t on fewer hoice
the of weightsAsthat
features. an
make a tradeoff
example of ho
how betcan
w we ween fittinga the
control mo training
model’s
del’s data and
tendency to ov berfit
eingorsmall.
overfit underfitThis
viagiv es ust
weigh
weight
solutions
deca
decay y, we that have aa high-degree
can train smaller slope, or put wregression
polynomial eight on fewer
mo
model of with
del the features.
differentt vAs
differen an
alues
example of how we can control
of λ. See Fig. 5.5 for the results. a mo del’s tendency to ov erfit or underfit via weigh t
decay, we can train a high-degree polynomial regression model with different values
of λ. See Fig. 5.5 for the results.

λ λ

More generally
generally,, we can regularize a mo del that learns a function f ( x; θ ) by
model
adding a penalty called a regularizer to the cost function. In the case of weigh eightt
More
deca
decay generally , we can
y, the regularizer is Ω( regularize
> a mo del that learns a function
w) = w w. In Chapter 7, we will see that man
Ω(w f (
many x; θ ) by
y other
adding a penalty called a regularizer to the cost function. In the case of weight
decay, the regularizer is Ω(w) = w w. 119In Chapter 7, we will see that many other
CHAPTER 5. MACHINE LEARNING BASICS

regularizers are possible.


Expressing preferences for one function over another is a more general wa wayy
regularizers are possible.
of con
controlling
trolling a model’s capacity than including or excluding members from the
hyp Expressing
ypothesis preferences
othesis space. for one
We can think function aovfunction
of excluding er another
fromisaahmore
yp general
ypothesis
othesis waasy
space
of controlling
expressing a model’sstrong
an infinitely capacity than including
preference against or excluding
that function.members from the
hypothesis space. We can think of excluding a function from a hypothesis space as
In our weigh
weightt deca
decay
y example, we expressed our preference for linear functions
expressing an infinitely strong preference against that function.
defined with smaller weigh eights
ts explicitly
explicitly,, vvia
ia an extra term in the criterion we
In our weigh t deca
minimize. There are man y example,
many w e expressed our preference
y other ways of expressing for linear
preferences for functions
differen
differentt
defined with smaller weigh ts explicitly
solutions, both implicitly and explicitly , via an extra term in
explicitly.. Together, these differen the criterion
differentt approac
approaches we
hes are
minimize.
kno wn as reThere
known are man
gularization
gularization.. y other ways of expressing preferences for different
solutions, both implicitly and explicitly. Together, these different approaches are
known as regularization. Regularization is one of the cencentral
tral concerns of the
field of machine learning, riv rivaled
aled in its imp
importance
ortance only by optimization.
Regularization is one of the central concerns of the
The no free lunc
lunchh theorem has made it clear that there is no best machine
field of machine learning, rivaled in its importance only by optimization.
learning algorithm, and, in particular, no best form of regularization. Instead
we mThe
ust no free alunc
choose h theorem
form has madethat
of regularization it clear that there
is well-suited to isthenoparticular
best machine
task
learning
we wan algorithm,
wantt to solv
solve. and, in particular, no b est form of regularization.
e. The philosophy of deep learning in general and this book in Instead
w e must choose
particular a form
is that a veryofwide
regularization that (suc
range of tasks is well-suited
(suchh as all of to
thethein particulartasks
intellectual
tellectual task
w e wan
that t to can
people solve.
do)The
mayphilosophy of effectiv
all be solved deep learning
ely usinginvery
effectively general and thisose
general-purp
general-purpose boforms
ok in
particular is that
of regularization. a v ery wide range of tasks (suc h as all of the in tellectual tasks
that people can do) may all be solved effectively using very general-purpose forms
of regularization.
5.3 Hyp
Hyperparameters
erparameters and Validation Sets

5.3 machine
Most Hyperparameters
learning algorithms and hav
havee sevVeral
alidation
several Sets
settings that we can use to con control
trol
the behavior of the learning algorithm. These settings are called hyp hyperp
erp
erpar
ar
arameters
ameters
ameters..
Mostvalues
The machine learning algorithms
of hyperparameters arehav
note sev eral settings
adapted by thethat we can
learning use to con
algorithm trol
itself
the behavior
(though we canof the
designlearning algorithm.
a nested learningThesepro settings
procedure
cedure are one
where called hyperpar
learning ameters.
algorithm
The
learnsvalues
the bof
esthyperparameters
hyperparametersare for not adapted
another by the
learning learning algorithm itself
algorithm).
(though we can design a nested learning procedure where one learning algorithm
In the
learns the pbolynomial regression example
est hyperparameters for anotherwe saw in Fig.algorithm).
learning 5.2, there is a single hyper-
parameter: the degree of the polynomial, whic which h acts as a capapacity
acity hypyperparameter.
erparameter.
In the p olynomial regression example
The λ value used to control the strength of weigh w e saw in Fig. 5.2 , there is a
weightt decay is another example single hyper-
of a
parameter:
hyp the
yperparameter.
erparameter. degree of the p olynomial, whic h acts as a cap acity h yp erparameter.
The λ value used to control the strength of weight decay is another example of a
hypSometimes
erparameter. a setting is chosen to b e a hyp yperparameter
erparameter that the learning algo-
rithm does not learn because it is difficult to optimize. More frequently frequently,, we do
Sometimes
not learn the hyp a setting
erparameter because it is not appropriate to the
yperparameteris chosen to b e a h yp erparameter that learnlearning
that hyp algo-
yper-
er-
rithm does not learn b ecause it is difficult to optimize. More
parameter on the training set. This applies to all hyperparameters that control frequently , we do
not
mo
modellearn
del the h. yp
capacity
capacity. If erparameter
learned on the because
training it set,
is not
suchappropriate to learnwould
hyperparameters that alwhypaer-
alwa ys
parameter on the training set. This applies to all hyperparameters that control
model capacity. If learned on the training 120set, such hyperparameters would always
CHAPTER 5. MACHINE LEARNING BASICS

cho
hoose
ose the maxim
maximum um possible mo model
del capacity
capacity,, resulting in ov overfitting
erfitting (refer to
Fig. 5.3). For example, we can alw alwa ays fit the training set better with a higher
choose pthe
degree maximand
olynomial um possibleeightt mo
a weigh del setting
decay capacity of, λresulting
= 0 thaninweovcould erfitting with(refer
a lowto
low er
Fig. 5.3 ). F or example, we
degree polynomial and a positive weigh can alw a ys fit
eightt deca the
decay training
y setting. set better with a higher
degree polynomial and a weight decay setting of λ = 0 than we could with a lower
To solve this problem, we need a validation set of examples that the training
degree polynomial and a positive weight decay setting.
algorithm do does
es not observe.
To solve this problem, we need a validation set of examples that the training
Earlier we discussed ho how w a held-out test set, comp composed
osed of examples coming from
algorithm does not observe.
the same distribution as the training set, can be used to estimate the generalization
errorEarlier we discussed
of a learner, howlearning
after the a held-out
pro test has
process
cess set, completed.
composed ofItexamples
is imp
important coming
ortant thatfrom
the
the same
test distribution
examples are not as theintraining
used any waset,
y tocanmak
makebee choices
used to about
estimate thethemo generalization
model,
del, including
error
its of a learner, afterF
hyperparameters. the
For learning
or this process
reason, has completed.
no example from the It is impset
test ortant
can that the
be used
testthe
in examples
validationare set.
not used in anywe
Therefore, waalwa
y to ys
always mak e choices the
construct about the model,
validation set including
from the
its
tr hyperparameters.
training
aining data. Specifically F or this reason, no example
Specifically,, we split the training data in from
into the
to tw test
twoo disjoin set can
disjointt subsets. be used
One
in these
of the validation
subsets isset.usedTherefore,
to learn the weparameters.
always construct the vsubset
The other alidationis oursetvfrom the
alidation
training
set, useddata. Specifically
to estimate the, generalization
we split the training
error data
during intoortw o disjoin
after t subsets.
training, One
allowing
of these subsets is used to learn the
for the hyperparameters to be updated accordingly parameters. The other subset is
accordingly.. The subset of data used toour v alidation
set, used to estimate the generalization
learn the parameters is still typically called errortheduring or after
training set, training,
ev
even
en though allowing
this
for
ma
may ythebe hyperparameters
confused with thetolarger be updated
po ol of accordingly
pool data used for . The
the subset of data pro
entire training used to
process.
cess.
learnsubset
The the parameters
of data used is still typically
to guide called the
the selection of training set, even isthough
hyperparameters called this
the
ma y b e confused
validation set. Typically with the larger
Typically,, one uses ab po ol
about of data used for the entire training
out 80% of the training data for training and pro cess.
20% for validation. Since the validationselection
The subset of data used to guide the set is used of to
hyperparameters is called the
“train” the hyperparameters,
vthe
alidation set. set
validation Typically , one
error will uses about 80%
underestimate of the trainingerror,
the generalization data though
for training and
typically
20%
by a for validation.
smaller amountSincethan the
the vtraining
alidationerror.
set isAfter
usedall to h“train”
yp the hyperparameters,
yperparameter
erparameter optimization
is complete, the generalization error may be estimated using the test set.typically
the validation set error will underestimate the generalization error, though
by a smaller amount than the training error. After all hyperparameter optimization
In practice, when the same test set has been used rep repeatedly
eatedly to ev evaluate
aluate
is complete, the generalization error may be estimated using the test set.
performance of different algorithms over many years, and esp especially
ecially if we consider
In practice,
all the attempts when the scientific
from the same testcomm set has
communitunit
unity ybeen
at bused
eatingrep eatedly
the reported to ev aluate
state-of-
performance
the-art of different
performance algorithms
on that test set,owe
ver end
many upyhaving
ears, and especially
optimistic ev if we consider
evaluations
aluations with
all the attempts from
the test set as well. Benc the scientific
Benchmarks
hmarks can th comm
thus unit y at b eating the reported
us b ecome stale and then do not reflect the state-of-
the-art
true fieldperformance
performance on of
that test set,system.
a trained we end up having optimistic
Thankfully
Thankfully, , the communit evaluations
community y tends with
to
the
mo
mov test set as well. Benc hmarks
ve on to new (and usually more am can th us b
ambitious ecome stale and
bitious and larger) benc then
enchmarkdo not reflect
hmark datasets. the
true field performance of a trained system. Thankfully, the community tends to
move on to new (and usually more ambitious and larger) benchmark datasets.
5.3.1 Cross-V
Cross-Validation
alidation

5.3.1 Cross-V
Dividing the datasetalidation
into a fixed training set and a fixed test set can be problematic
if it results in the test set being small. A small test set implies statistical uncertaint
uncertainty
y
Dividingthe
around theestimated
dataset into a fixed
average training
test set and it
error, making a fixed testtosetclaim
difficult can bthat
e problematic
algorithm
if itwresults
A in thethan
orks better test algorithm
set being small.
B on A small
the test
given set implies statistical uncertainty
task.
around the estimated average test error, making it difficult to claim that algorithm
A works better than algorithm B on the given task.
121
CHAPTER 5. MACHINE LEARNING BASICS

When the dataset has hundreds of thousands of examples or more, this is not
a serious issue. When the dataset is to too o small, there are alternative procedures,
whic
which When the dataset has hundreds of
h allow one to use all of the examples in the thousands of examples
estimationorofmore, this istest
the mean not
a serious
error, issue.
at the priceWhen the dataset
of increased is too small,
computational there
cost. arepro
These alternative
procedures
cedures are procedures,
based on
whic h allow
the idea of rep one to
repeating use all of the examples in the estimation
eating the training and testing computation on different randomly of the mean test
cerror,
hosenatsubsets
the price of increased
or splits computational
of the original dataset. cost.
TheThese
most pro ceduresofare
common thesebased on
is the
kthe idea
-fold of repalidation
cross-v eating the
cross-validation protraining
cedure, and
procedure, shown testing computation
in Algorithm 5.1,onin different
which a randomly
partition
cofhosen subsets or splits of the original
the dataset is formed by splitting it in dataset. The
to k non-o
into non-ovverlapping subsets. Theistest
most common of these the
k -fold
error ma cross-v
may alidation pro cedure, shown in Algorithm 5.1 , in
y then be estimated by taking the average test error across k trials. On which a partition
of thei, dataset
trial the i-th is formed
subset by splitting
of the data is usedit inas k non-o
tothe test verlapping
set and the subsets.
rest of the Thedatatest
is
error ma y then b e estimated by taking the a v erage test error
used as the training set. One problem is that there exist no unbiased estimators of across k trials. On
trialviariance
the , the i-th of subset
such avoferage
the data
errorisestimators
used as the test setand
(Bengio andGrandv
the rest
Grandvalet of, the
alet 2004data
), butis
used
appro as the
approximations training set. One
ximations are typically used. problem is that there exist no unbiased estimators of
the variance of such average error estimators (Bengio and Grandvalet, 2004), but
approximations are typically used.
5.4 Estimators, Bias and Variance

5.4fieldEstimators,
The of statistics givesBias
gives us manand
manyy to Vthat
tools
ols ariance
can be used to ac
achiev
hiev
hievee the machine
learning goal of solving a task not only on the training set but also to generalize.
The field of statistics
Foundational conceptsgivsuc
es husasman
such y tools that
parameter can be used
estimation, to acvhiev
bias and e theare
ariance machine
useful
learning goal of solving a task not only on the training set but also to
to formally characterize notions of generalization, underfitting and overfitting. generalize.
Foundational concepts such as parameter estimation, bias and variance are useful
to formally characterize notions of generalization, underfitting and overfitting.
5.4.1 Poin
ointt Estimation

5.4.1
P oin Point Estimation
ointt estimation is the attempt to provide the single “best” prediction of some
quan
quantit tit
tity
y of interest. In general the quanquantit
tit
tityy of interest can be a single parameter
Poin
or a tvector
estimation is the attempt
of parameters in sometoparametric
provide themo single
model, “best”
del, such as prediction
the weigh tsofinsome
eights our
quan tity of interest. In general the quan tit y of interest can
linear regression example in Sec. 5.1.4, but it can also be a whole function. be a single parameter
or a vector of parameters in some parametric mo del, such as the weights in our
In order to distinguish estimates of parameters from their true value, our
linear regression example in Sec. 5.1.4, but it can also be a whole function.
con
conv ven
ention
tion will be to denote a point estimate of a parameter θ by θ θ̂ˆ.
In order to distinguish estimates of parameters from their true value, our
Let {x(1), . . . , x (m)} be a set of m indep
independen enden
endentt and identically ˆ distributed
convention will be to denote a point estimate of a parameter θ by θ.
(i.i.d.) data points. A point estimator or statistic is an any
y function of the data:
Let x , . . . , x be a set of m independent and identically distributed
(i.i.d.) data θˆm = g(xor
{ points. A p}oint estimator (1) statistic
, . . . , x(m)is).any function of the data: (5.19)

The definition dodoes θˆ =


es not require g(xg return
that , . . . , x a v)alue
. that is close to the(5.19)
true
θ or ev
even
en that the range of g is the same as the set of allo allow
wable values of θ.
The definition
This definition do
of es
a pnot
ointtrequire
oin estimator is gvery
that return
generala value thatwsisthe
and allo
allows close to theoftrue
designer an
θ or even that the range
estimator great flexibility of g is the
flexibility.. While almost ansame
any as the set of allowable values
y function thus qualifies as an estimator, of θ.
This definition of a point estimator is very general and allows the designer of an
estimator great flexibility. While almost122 any function thus qualifies as an estimator,
CHAPTER 5. MACHINE LEARNING BASICS

The k -fold cross-vcross-validation


alidation algorithm. It can be used to estimate
generalization error of a learning algorithm A when the giv given
en dataset D is to tooo
The k -fold
small for a simple train/test or train/v cross-v alidation
train/valid algorithm. It can b e used
alid split to yield accurate estimation to estimate of
D
generalization error of a learning algorithm A
generalization error, because the mean of a loss L on a small test set mawhen the giv en dataset
may y ha
havis
ve toto
toooo
smallvariance.
high for a simple The train/test
dataset D con or train/v
tains asalid
contains split to
elements theyield accurate
abstract estimation
examples z(i) (for of
generalization
i-th example), error,whichbecause the mean of a loss L on a small test set(i)may xha (iv
) ,eyto
(i)o)
the D stand for an (input,target) pair z = ((x
could
high
in thevariance.
case of The sup datasetlearning,
supervised
ervised contains orasfor
elements
just anthe abstract
input z(i) =examples
x(i) in thez case(for
the i -th example), which could stand for an (input,target)
of unsupervised learning. The algorithm returns the vector of errors e for eac pair z = (x , y
each h)
in the case
example in D of, whose
supervised meanlearning, or for just
is the estimated an input z error.
generalization = x The in errors
the case on
of unsupervised
individual examples learning. The algorithm returns
can be used to compute a confidence interv the vector of
interval errors e for
al around the mean eac h
D
example
(Eq. 5.47).in While
, whose thesemean is the estimated
confidence interv
intervals generalization
als are error. after
not well-justified The theerrors
useon of
individual
cross-v examples can b e used to compute a confidence interv
alidation, it is still common practice to use them to declare that algorithm A
cross-validation, al around the mean
(Eq. 5.47than
is better ). While these B
algorithm confidence
only if theinterv als are not
confidence in well-justified
interv
terv al of the error
terval after
of the use of
algorithm
cross-v
A lies balidation,
elow anditdoes is still
notcommon
in
intersect
tersectpractice to use them
the confidence in to al
interv
terv
tervaldeclare that algorithm
of algorithm B. A
is better than algorithm B only if the confidence interval of the error of algorithm
A lies below and (does D, A,not L, kin):tersect the confidence interval of algorithm B.
D, the given D dataset, with elemenelements ts z (i)
( , A, L, k ):
D , the learning algorithm, seen as a function that tak
A takeses a dataset as
, the given dataset,
input and outputs a learned function with elemen ts z
A ,, the
L
the losslearning algorithm,
function, seen asseen as a function
a function from a that takes
learned a dataset
function f and as
input and outputs a
an example z(i) ∈ D to a scalar ∈ Rlearned function
L, ,thethenum lossber
number function, seen as a function from a learned function f and
k D of folds R
an example
Split D into zk mutually to aexclusiv
scalar e subsets D , whose union is D.
exclusive i
iD k, the
from 1 tonum k∈ ber of folds ∈ D D
Split
f i = Ainto
(D\D k mutually exclusive subsets , whose union is .
i)
i from
z (j)Din1 to k
DD i
f e= A ( ) (j ) )
j = L(fD i, z
z in \
e = L(f , z )
e

123
CHAPTER 5. MACHINE LEARNING BASICS

a gogoood estimator is a function whose output is close to the true underlying θ that
generated the training data.
a good estimator is a function whose output is close to the true underlying θ that
For no
now,
w, we tak
takee the frequentist persp perspective
ective on statistics. That is, we assume
generated the training data.
that the true parameter value θ is fixed but unknown, while the point estimate
θ̂ˆ isFaorfunction
θ now, weoftak e the
the data.frequentist
Since thepersp
dataective
is drawnon statistics. That is,process,
from a random we assumeany
that the of
function true
theparameter
data is random.value θTherefore
is fixed butθˆ isunknown,
a randomwhile the point estimate
variable.
θˆ is a function of the data. Since the data is drawn from a random process, any
Poin
ointt estimation can also refer to the estimation of the relationship b et etw
ween
function of the data is random. Therefore θˆ is a random variable.
input and target variables. We refer to these types of point estimates as function
Point estimation can also refer to the estimation of the relationship b etween
estimators.
input and target variables. We refer to these types of point estimates as function
estimators.
As we mentioned ab abo ove, sometimes we are in interested
terested in
performing function estimation (or function appro approximation).
ximation). Here we are trying to
predict a variable y giv given As we mentioned ab ov
en an input vector x . We assume e, sometimes thatwe are is
there interested in
a function
p erforming function
f (x ) that describ
describes estimation (or function
es the approximate relationship b etappro ximation).
etw Here w e are
ween y and x. For example, trying to
predict
w e may aassume
variable y giv
that y en
= fan( xinput where x. stands
) + ,vector We assumefor the that there
part of yisthat
a function
is not
f (x ) that describ es the approximate relationship b
predictable from x. In function estimation, we are interested in approet w een y and x . F or example,
approximating
ximating
w e may assume that y = f ( x )
ˆ + , where  stands for
f with a model or estimate f . Function estimation is really just the same the part of y that is not
as
predictable from x . In function estimation, w e are
estimating a parameter θ; the function estimator f is simply a p oinˆ interested in appro ximating
ointt estimator in
f with a space.
function modelTheor estimate fˆ. Function
linear regression exampleestimation
(discussed is really
abov
abovee injust the5.1.4
Sec. same as
) and
θ; the ˆ
estimating
the a parameter
polynomial regression function
example estimator
(discussed in fSec.
is simply
5.2) area pboth
oint estimator
examples in of
function space.
scenarios that ma The
may linear regression example (discussed
y be interpreted either as estimating a parameter abov e
parameterw in Sec. 5.1.4
w or estimating) and
the polynomial
ˆ regression
a function f mapping from x to y. example (discussed in Sec. 5.2 ) are both examples of
scenarios that may be interpreted either as estimating a parameter w or estimating
We nonoww review the most commonly studied prop properties
erties of poin
ointt estimators and
a function fˆ mapping from x to y.
discuss what they tell us ab about
out these estimators.
We now review the most commonly studied properties of point estimators and
discuss what they tell us about these estimators.
5.4.2 Bias

5.4.2
The bias Bias
of an estimator is defined as:

The bias of an estimator is defined bias(θθ̂ˆm


as:) = E(θ θ̂ˆm ) − θ (5.20)
E ˆ samples from a random variable) and
where the expexpectation
ectation is over the θˆ (seen
data
bias( ) = as (θ ) θ (5.20)
θ is the true underlying value of θ used to define−the data generating distribution.
where the expθθ̂ectation
ˆm is saidisto ovbere the datad (seen as(θ
if bias
bias( θ̂ˆsamples from a random variable) θ̂ˆmand
An estimator unbiase
unbiased m) = 0, which implies that E( θ )=
θ. isAn
θ theestimator
true underlying
θ value
θ̂ˆm is said of θasymptotic
to be used to define
asymptotical al
ally the data
ly unbiase
unbiasedd if generating
limm→∞ biasdistribution.
bias( ˆ m ) = 0,
θ̂

ˆ is said to be unbiase ˆ E ˆ
An
whic
which h impliesθthat
estimator limm→∞ E(θ θ̂ˆm) =d θif. bias(θ ) = 0, which implies that ( θ ) =
θ. An estimator θˆ is said to be asymptotically unbiased if lim bias(θˆ ) = 0,
E ˆ
which implies that lim (θ ) = θ.
Consider a set of samples {x (1), . . . , x(m) }
that are indep
independently
endently and iden identically
tically distributed according to a Bernoulli distri-
Consider a set of samples x , . . . , x
124
that are independently and identically distributed according to a {Bernoulli distri- }
CHAPTER 5. MACHINE LEARNING BASICS

bution with mean θ:


P (x (i); θ) = θ x (1 − θ)(1−x )
. (5.21)
bution with mean θ:
A common estimator for Pthe (x θ ;parameter
θ) = θ (1of this θ) distribution
. is the mean (5.21)of the
training samples:
A common estimator for the θ parameter X mof−this distribution is the mean of the
1
training samples: θˆm = x(i) . (5.22)
m
1 i=1
θˆ = x . (5.22)
To determine whether this estimator ism biased, we can substitute Eq. 5.22 in into
to Eq.
5.20
5.20::
To determine whether this estimator is biased, we can substitute Eq. 5.22 into Eq.
5.20: bias(ˆθθ̂m) = E[θθ̂ˆm] − θ X (5.23)
" m
#
E ˆ1 X
bias(θˆ ) = =E [θ ] θ x (i) − θ (5.23)
(5.24)
m
E 1− i=1
= m hx i θ (5.24)
1 X m ( i)
= E x −−θ (5.25)
m
1 i=1 E
= "X m X 1x # θ  (5.25)
m
1
= X ( i ) x
x− θ (1 − θ) (1−x )
−θ (5.26)
m
1 i=1 x =0
= m x θ (1 θ) θ (5.26)
1 X
m X (θ) − θh i
= − − (5.27)
m
1 i=1
= θ − θ =(θ0) θ
= (5.27)
(5.28)
mX X  

=θ θ=0 (5.28)
Since bias(ˆθ) = 00,, we say that our estimator θˆ is un unbiased.
biased.

X
Since bias(θ) = 0, we say that our estimator θˆ is unbiased.
ˆ
No
Now,
w, consider
a set of samples {x(1) , . . . , x(m) } that are indep independenenden
endently tly and iden tically distributed
identically
according to a Gaussian distribution p(x ) = N (x ; µ, σ ), where iNo
( i ) ( i ) 2 ∈w,
{1,consider
. . . , m}.
aRecall
set ofthat
samples x , . . . , x that
the Gaussian probability densit are indep
density enden tly
y function is giv and iden
given
en by tically distributed
according to a Gaussian
{ distribution
} p(x ) = (x ; µ, σ ), where i 1, . . . , m .
!
Recall that the Gaussian probability1 density function N 1 (x (i) is − giv
µ)2en by ∈ { }
p(x(i); µ, σ2 ) = √ exp − 2
. (5.29)
2π σ 2 2 σ
1 1 (x µ)
p(x ; µ, σ ) = exp . (5.29)
2 π σ 2 σ−
A common estimator of the Gaussian √ mean−parameter is kno knownwn as the sample
me
mean
an
an::
m
A common estimator of the Gaussian 1X mean parameter is ! known as the sample
mean : µ̂
µ
ˆ m = x(i) (5.30)
m
1 i=1
µ
ˆ = x (5.30)
m125

X
CHAPTER 5. MACHINE LEARNING BASICS

To determine the bias of the sample mean, we are again interested in calculating
its exp
expectation:
ectation:
To determine the bias of the sample mean, we are again interested in calculating
its expectation: bias(
bias(ˆ µ̂
µ
ˆ m ) = E[µ ˆm] − µ (5.31)
" m
#
E 1 X
bias(µ ˆ )= = E [µ ˆ ] µx(i) − µ (5.31)
(5.32)
m−
E 1 i=1

= x (5.32)
1mX h (i) i
m
= E x − −µ (5.33)
m
1 i=1 E!
= " X m x# µ (5.33)
m1 X
= µ −µ − (5.34)
m
1 i=1
=
=µ− =µ 0 h µi! (5.34)
(5.35)
m µX

Th us we find that the sample mean is an unbiased estimator of Gaussian (5.35)
Thus = µ µ = 0 mean
parameter. − !
Thus we find that the sample mean is an Xunbiased estimator of Gaussian mean
parameter.
As an
2
example, we compare two differendifferentt estimators of the variance parameter σ of a
Gaussian distribution. We are in interested
terested in kno knowingwing if either estimator is biased.As an
example, we compare two differen t estimators of the
2 we consider is known as the sample varianc
variance parameter σ of a
The first estimator of σ
Gaussian distribution. We are interested in knowing if either estimator is biased. variancee :

1 X  is(iknown as2 the sample variance :


The first estimator of σ we consider m
2 )
σ̂
σ
ˆm = x −µ µ̂
ˆm , (5.36)
m
1 i=1
σ
ˆ = x µ
ˆ , (5.36)
where µµ̂
ˆ m is the sample mean, defined m abov above.e.−More formally
formally,, we are in
interested
terested in
computing
where µˆ is the sample mean,bias( defined
bias(ˆσˆ 2m) =
σ̂ abov
E[σˆe.m
2 More2 formally, we are interested in
]−σ (5.37)
computing X  
We begin by ev evaluating
aluating the term σ̂ 2 ]: E
bias(Eσˆ[σˆm)= [σ
ˆ ] σ (5.37)
E" − #
We begin by evaluating the term [σ ˆ1 X ]:m  2
2 ( i)
E[σσ̂
ˆm ] = =EE x −µ µ̂
ˆm (5.38)
m
E E 1 i=1

ˆ ] =m − 1 x µ
ˆ (5.38)
= m σ2 (5.39)
m −
m 1
= σ (5.39)
Returning to Eq. 5.37, we conclude"m −that the bias of σ ˆ2m #is −σ2 /m. Therefore, the
σ̂
sample variance is a biased estimator. X  
Returning to Eq. 5.37, we conclude that the bias of σ ˆ is σ /m. Therefore, the
sample variance is a biased estimator. −
126
CHAPTER 5. MACHINE LEARNING BASICS

The unbiase
unbiased
d sample varianc
variancee estimator
Xm  2
The unbiased sample varianc 2
e 1estimator
σ̃
σ
˜m = x(i) − µ µ̂
ˆm (5.40)
m−1
1 i=1
σ
˜ = x µ
ˆ (5.40)
pro
provides
vides an alternative approac
approach. h.mAs 1the name suggests
− this estimator is un
unbiased.
biased.
That is, we find that E[σ ˜ 2m ] = σ 2: −
σ̃
provides an alternative approach. As the name suggests this estimator is unbiased.
E " #
That is, we find that [σ ˜ ]=σ : 1 XX
m  2
2 (i)
E[σσ̃
˜m] = E x −µ µ̂
ˆm (5.41)
m−1
E E 1 i=1

˜ ]= m 2 x µ
ˆ (5.41)
= m E[1σ σ̂
ˆm ] (5.42)
m−1 −
m − E 
= m ˆm −
[σ ] 1 2 (5.42)
= m" 1 σ # (5.43)
m−1 m
2
m− mX  1 
=σ . σ (5.43)
(5.44)
m 1 m−
We ha ve two estimators:=one
hav σ .−is biased and the other is not. While unbiased (5.44)
estimators are clearly desirable, they  are not alwalwa ays the “b
 “best”
est” estimators. As we
W e have tw o estimators: one is biased and the
will see we often use biased estimators that possess other imp other is not.
importantWhile
ortant unbiased
properties.
estimators are clearly desirable, they are not always the “best” estimators. As we
will see we often use biased estimators that possess other important properties.
5.4.3 Variance and Standard Error

5.4.3 V
Another ariance
prop
propert
ert
ertyy ofand Standard
the estimator Error
that we migh
mightt wan
antt to consider is ho
how w muc
muchh
we exp
expect
ect it to vary as a function of the data sample. Just as we computed the
Another
exp propofert
expectation
ectation y ofestimator
the the estimator that we its
to determine migh t ww
bias, anetcan
to consider
computeisitshovarianc
w muche.
variance
we exp ect it to vary as a function of the data
The variance of an estimator is simply the variance sample. Just as we computed the
expectation of the estimator to determine its bias, we can compute its variance.
The variance of an estimator is simply theθˆ)variance
Var( (5.45)

where the random variable is the training Var(θˆset.


) Alternately
Alternately,, the square ro ot (5.45)
root of the
variance is called the standar
standardd err
error
or ˆ
or,, denoted SE(θθ̂).
where the random variable is the training set. Alternately, the square root of the
The variance or the standard error of an estimator pro provides
vides a measure of ho how w
variance is called the standard error, denoted SE(θˆ).
we would exp
expect
ect the estimate we compute from data to vary as we indep independently
endently
The variance
resample or thefrom
the dataset standard error of andata
the underlying estimator provides
generating proacess.
measure
process. Justofasho wwe
w e
mighwould exp ect the estimate we compute from data to v ary as
mightt like an estimator to exhibit low bias we would also like it to ha w e indep
hav endently
ve relativ
relatively
ely
resample
lo
low the
w variance. dataset from the underlying data generating pro cess. Just as we
might like an estimator to exhibit low bias we would also like it to have relatively
When we compute an anyy statistic using a finite num numbber of samples, our estimate
low variance.
of the true underlying parameter is uncertain, in the sense that we could ha hav ve
When we compute an y statistic using a finite num b er of samples,
obtained other samples from the same distribution and their statistics would hav our estimate
havee
of the true underlying parameter is uncertain, in the sense that we could have
127
obtained other samples from the same distribution and their statistics would have
CHAPTER 5. MACHINE LEARNING BASICS

been different. The exp expected


ected degree of variation in any estimator is a source of
error that we wan wantt to quan
quantify
tify
tify..
been different. The exp ected degree of variation in any estimator is a source of
The standard error of the mean is giv given
en by
error that we want to quantify.
v
The standard error of the mean u is givenmby
u 1 X (i) σ
SE(
SE(ˆ ˆ m) = tVar[
µ̂
µ x ]= √ , (5.46)
m m
1 i=1 σ
SE(µ ˆ ) = Var[ x ]= , (5.46)
2
m i
m
where σ is the true variance of the samples x . The √ standard error is often
estimated by using an estimate of σ. Unfortunately Unfortunately,, neither the square ro root
ot of
where
the sampleσ isvariance
the truenor variance
the square ofvthe
ro
root samples
ot of the x biased
un . The estimator
unbiased standard of error
the isariance
v often
u
u X
estimated
pro vide anbunbiased
provide y using an estimate
estimate of ttheσ. standard
of Unfortunately , neither
deviation. Both theapproaches
square root of
tend
theunderestimate
to sample variance thenor
true thestandard
square ro ot of the un
deviation, butbiased estimator
are still used inofpractice.
the variance
The
pro vide
square ro an
root unbiased estimate of the standard deviation. Both
ot of the unbiased estimator of the variance is less of an underestimate. approaches tend
toorunderestimate
F large m, the appro the true
approximation standard
ximation deviation,
is quite but are still used in practice. The
reasonable.
square root of the unbiased estimator of the variance is less of an underestimate.
The standard error of the mean is very useful in machine learning exp experimen
erimen
eriments.
ts.
For large m, the approximation is quite reasonable.
We often estimate the generalization error by computing the sample mean of the
errorTheonstandard
the test error
set. of thenum
The mean
numb beris of
very useful ininmachine
examples the testlearning experimen
set determines ts.
the
W e often estimate
accuracy the generalization
of this estimate. Taking adv error
advan an by of
antage
tage computing
the centralthelimit
sample mean of
theorem, theh
whic
which
errorusonthat
tells thethetest set. will
mean Thebe numapprober of examples
approximately
ximately in the
distributed testa normal
with set determines the
distribution,
accuracy
w e can useofthe this estimate.
standard errorTaking advantage
to compute of the central
the probability thatlimit theorem,
the true exp which
expectation
ectation
tells us that the
falls in any chosen inmean will
interv
terv
terval. b e appro ximately distributed
al. For example, the 95% confidence in with a normal
interv
terv
terval distribution,
al centered on
w e can
the meanuseisthe
ˆmstandard
µ̂
µ is error to compute the probability that the true expectation
falls in any chosen interval. For example, the 95% confidence interval centered on
the mean is µ ˆ is (µ ˆ m − 1.96SE(
µ̂ 96SE(ˆ µˆ m), µˆm + 1.96SE(
96SE(ˆ µ
ˆm))
)),, (5.47)

under the normal distribution (µˆ 1.96SE(


with µ
ˆmean), µ
ˆµˆ m+and
µ̂ 1.96SE( µ
varianceˆ )),SE ˆm )2 . In mac
µ̂
SE(( µ (5.47)
machine
hine
learning exp eriments, it is −
experiments, common to sa sayy that algorithm
algorithmA A is better than algorithm
under
B if thethe normal
upp
upper er b distribution
ound of the 95% with mean
confidence µ
ˆ in and
interv
terv
terval v
al ariance
for the SE( µ
error ˆof )algorithm
. In machine
A is
learning exp
less than the loweriments,
lower it is common to sa y that
er bound of the 95% confidence in algorithm
interv
terv
terval A is b etter than algorithm
al for the error of algorithm
B
B. if the upp er b ound of the 95% confidence interv al for the error of algorithm A is
less than the lower bound of the 95% confidence interval for the error of algorithm
B.
We once again consider a set of samples
(1) ( m
{x , . . . , x } dra ) drawn
wn indep
independently
endently and iden identically
tically from a Bernoulli distribution
( i ) x (1−x ) We once again consider a set of samples
(recall P (x ; θ) = θ (1 − θ) ).PThis time we are in interested
terested in computing
x ,...,x drawn independently
ˆ 1 andmiden(itically
) from a Bernoulli distribution
the variance of the estimator θθ̂m = m i=1 x .
(recall
{ P (x ; θ}) = θ (1 θ) ). This time we are interested in computing
ˆ !
the variance of the estimator − θ =  x1 X .m
Var θˆm = V Var
ar x ( i) (5.48)
m
1 i=1
Var θˆ = Var x (5.48)
128 m
P

!
  X
CHAPTER 5. MACHINE LEARNING BASICS

1 X
m  
= 2 Var x (i) (5.49)
m
1 i=1
= m Var x (5.49)
1 X
m
= 2 θ(1 − θ) (5.50)
m
1 i=1
= 1 θ(1 θ) (5.50)
= m2 mθ X (1 − 
θ)  (5.51)
m −
1
= 1 θmθ (1 θ) (5.51)
= m (1 − θ)
m
(5.52)
1 X −
= asθ(1
The variance of the estimator decreases θ)
a function of m, the num
numb (5.52)
b er of examples
m
in the dataset. This is a common propproperty
erty−of popular estimators that we will
The variance of the estimator decreases as
return to when we discuss consistency (see Sec. 5.4.5of).m, the numb er of examples
a function
in the dataset. This is a common property of popular estimators that we will
return to when we discuss consistency (see Sec. 5.4.5).
5.4.4 Trading off Bias and Variance to Minimize Mean Squared
Error
5.4.4 Trading off Bias and Variance to Minimize Mean Squared
Bias andError
variance measure two different sources of error in an estimator. Bias
measures the exp expected
ected deviation from the true value of the function or parameter.
Bias and v ariance
Variance on the other measure twovides
hand, pro different
provides a measure sources of error
of the in anfrom
deviation estimator. Bias
the expected
measures the
estimator expthat
value ectedanydeviation
particular from the trueofvalue
sampling of the
the data is function
likely to or parameter.
cause.
Variance on the other hand, provides a measure of the deviation from the expected
What happ
estimator happens
valueens
thatwhen
any w e are given
particular a choiceofbetw
sampling between
the een
datatw isolikely
estimators, one with
to cause.
more bias and one with more variance? How do we choose betw etween
een them? For
What happ ens when w e
example, imagine that we are interes are given
interested a choice betw een tw o estimators,
ted in approximating the function shown one within
more bias and one with more variance?
Fig. 5.2 and we are only offered the choice betw How do
between w e choose
een a mo
model b etw een them?
del with large bias and For
example,
one imagine
that suffers thatlarge
from we are interesHo
variance. tedwindoapproximating
How we cho hoose
ose betw the
etweeneenfunction
them? shown in
Fig. 5.2 and we are only offered the choice between a model with large bias and
The most common way to negotiate this trade-off is to use cross-v cross-validation.
alidation.
one that suffers from large variance. How do we choose between them?
Empirically
Empirically,, cross-v
cross-validation
alidation is highly successful on many real-w real-worldorld tasks. Alter-
The
nativ
natively
ely
ely,, most
w e common
can also w ay
compare to negotiate
the me
mean this
squar trade-off
d
an squaree error error is
(MSE)to use
of cross-v
the alidation.
estimates:
Empirically, cross-validation is highly successful on many real-world tasks. Alter-
natively, we can also compare MSEthe = Eme[(θˆθ̂an
m− θ) 2 ]ed error (MSE) of the estimates:
squar (5.53)
E ˆ θˆθ̂m) 2 + Var(θˆm )
MSE = = Bias(
[(θ θ) ] (5.54)
(5.53)
The MSE measures the overall=exp expected
ected
Bias( ˆ + Var(θˆ ) a squared error sense—
θ−)deviation—in (5.54)
bet ween the estimator and the true value of the parameter θ. As is clear from
etw
The MSE
Eq. 5.54 , evmeasures
evaluating
aluating thetheMSE
overall exp
incorp ectedbdeviation—in
incorporates
orates oth the bias anda the squared errorDesirable
variance. sense—
between the
estimators areestimator
those withand theMSE
small trueand value of the
these parameter that
are estimators θ. Asmanage
is cleartofrom
keep
Eq.
b oth5.54 , evbias
their aluating
and the MSE incorp
variance somewhatoratesinbcoth hec
heck. the
k. bias and the variance. Desirable
estimators are those with small MSE and these are estimators that manage to keep
The relationship betw etween
een bias and variance is tightly linked to the machine
both their bias and variance somewhat in check.
learning concepts of capacity
capacity,, underfitting and ov overfitting.
erfitting. In the case where gen-
The relationship between bias and variance is tightly linked to the machine
learning concepts of capacity, underfitting 129 and overfitting. In the case where gen-
CHAPTER 5. MACHINE LEARNING BASICS

eralization error is measured by the MSE (where bias and variance are meaningful
comp
componen
onen
onents
ts of generalization error), increasing capacity tends to increase variance
eralization
and decreaseerror is This
bias. measured by the MSE
is illustrated (where
in Fig. bias and
5.6, where we vsee
ariance
againare
themeaningful
U-shap
U-shaped
ed
comp
curv onen ts of generalization error), increasing capacity
curvee of generalization error as a function of capacit
capacityy. tends to increase variance
and decrease bias. This is illustrated in Fig. 5.6, where we see again the U-shaped
curve of generalization error as a function of capacity.
5.4.5 Consistency

5.4.5
So far weConsistency
hav
havee discussed the prop
properties
erties of various estimators for a training set of
fixed size. Usually
Usually,, we are also concerned with the behavior of an estimator as the
So far twe
amoun
amount of hav e discussed
training the prop
data grows. In erties of various
particular, estimators
we usually for a training
wish that, as the numsetb er
numb of
fixed
of size.
data Usually
points m in, we
ourare also concerned
dataset increases, with
our pthe
ointtbestimates
oin ehavior ofconv
an estimator
erge to theastrue
converge the
amoun t of training
value of the corresp data
correspondinggrows. In particular, we
onding parameters. More formallyusually wish that,
formally,, we would lik as the
likee that num b er
of data points m in our dataset increases, our p oint estimates converge to the true
p
value of the corresponding parameters. lim θˆMore
m→θ formally
. , we would like that (5.55)
m→∞

p lim θˆ θ. (5.55)
The sym bol → means that the conv
symb convergence probability,, i.e. for any  > 0,
ergence is in probability

P (|θθ̂ˆm − θ| > ) → 0 as m → ∞ . The condition describ describeded b byy Eq. 5.55 is
The
kno
known symb ol means
wn as consistency that the conv ergence is in probability , i.e.
onsistency.. It is sometimes referred to as weak consistency for any , with
consistency, > 0,
ˆ θ >
P ( θ consistency→ ) referring
0 as m . Thesur condition describ
strong to the almost sure
e con
convergence
vergence of θθ̂ed
ˆ tobθy. Eq. 5.55
Almost suris
sure
e
kno| wn −as |consistency
→ . It is → sometimes
∞ referred to as weak consistency, with
strong consistency referring to the almost 130 sure convergence of θˆ to θ . Almost sure
CHAPTER 5. MACHINE LEARNING BASICS

conver gencee of a sequence of random variables x (1), x (2), . . . to a value x occurs


onvergenc
genc
when p(lim m→∞ x(m) = x) = 1.
convergence of a sequence of random variables x , x , . . . to a value x occurs
Consistency x ensures that the bias induced by the estimator is assured to
when p(lim = x) = 1.
diminish as the num umb ber of data examples gro grows. ws. How Howev ev
ever,
er, the revreverse
erse is not
Consistency ensures that the bias induced
true—asymptotic unbiasedness does not imply consistency b y the estimator
consistency.. For example, is assured to
consider
diminish asthe
estimating themean
numbparameter
er of dataµexamples
of a normal grows. However,Nthe
distribution rev
(x ; µ, σerse
2
), withis not
a
true—asymptotic unbiasedness does not
(1) imply ( m consistency
)
dataset consisting of m samples: {x , . . . , x } . We could use the first sample . For example, consider
xestimating the mean
(1) of the dataset as parameter
an unbiase
unbiased dµ estimator:
of a normalˆ θθ̂ = x(1). In that(xcase,
distribution ; µ, σE(),ˆθθ̂mwith
) = θa
dataset consisting
so the estimator is un of m samples:
unbiased x ,
biased no matter how man. . . , xmany . W e could
y data p oin use
oints
ts the
N are seen.first sample
ˆ E ˆThis, of
x of the dataset as an unbiase d estimator:
course, implies that the estimate is{ asymptotically θ =
} un x . In
unbiased. that
biased. Ho Howcase,
w ev
ever, ( θ )is=not
er, this θ
asoconsisten
the estimator
consistent is unbiased
t estimator as it isno
notmatter
the case howthat man θθ̂my data
ˆ → θ asp oin
mts→are
∞.seen. This, of
course, implies that the estimate is asymptotically unbiased. However, this is not
a consistent estimator as it is not the case that ˆ θ θ as m .
5.5 Maxim Maximum um Lik Likeliho
eliho
elihoo od Estimation → →∞

5.5 Maxim
Previously
Previously, , we hav um
have e seenLik eliho
some od Estimation
definitions of common estimators and analyzed
their properties. But where did these estimators come from? Rather than guessing
Previously
that some ,function
we havemightseen mak
some
make e adefinitions
go
goo of common
od estimator and then estimators
analyzing and
its analyzed
bias and
their properties. But
variance, we would lik where
likee to hadid
hav these estimators come
ve some principle from whic from?
which h we can derive guessing
Rather than specific
that somethat
functions function
are go might
goo make a gofor
od estimators oddifferen
estimator
different t moand
models.
dels.then analyzing its bias and
variance, we would like to have some principle from which we can derive specific
The most
functions thatcommon
are goodsuch principle
estimators forisdifferen
the maxim
maximum
t moum
dels.lik
likeliho
eliho
elihoo od principle.
Consider a set of m examples X = {x (1), . . . , x(m) } dra drawn wn indep
independently
endently from
The most common such principle is the maximum likelihood principle.
the true but unknown data generating X distribution p data ( x ) .
Consider a set of m examples = x , . . . , x drawn independently from
Let pmodel ( x; θ) b e a parametric family of probability distributions over the
the true but unknown data generating{distribution p} (x).
same space indexed by θ. In other words, p model(x ; θ) maps an anyy configuration x
Let p
to a real num
numb ( x; θ ) b e a parametric family of probability
b er estimating the true probability pdata (x). distributions over the
same space indexed by θ. In other words, p (x ; θ) maps any configuration x
The maxim
maximum um likelihoo
likelihood d estimator for θ is then defined as
to a real numb er estimating the true probability p (x).
The maximum likelihoo
θMLd =
estimator
arg max for θ is
p model(Xthen
; θ) defined as (5.56)
θ
X
θ = arg max Y pm ( ; θ) (5.56)
= arg max pmodel(x (i); θ) (5.57)
θ i=1
= arg max p (x ; θ) (5.57)
This pro duct over man
product y probabilities can be inconv
many enientt for a variet
inconvenien
enien y of reasons.
ariety
For example, it is prone to numerical underflow. To obtain a more con convvenien
enientt
This
but equiv pro duct
equivalent o ver man y probabilities can b e inconv enien t for a variety
alent optimization problem, we observe that taking the logarithm of the of reasons.
F or example, Y
lik
likeliho
eliho
elihoood do esitnot
does is prone
changetoitsnarg
umerical underflow.
max but do
does
es convTenien
o obtain
convenien
eniently a more con
tly transform venien
a pro ductt
product
but equivalent optimization problem, we observe that taking the logarithm of the
likelihood does not change its arg max but does conveniently transform a product
131
CHAPTER 5. MACHINE LEARNING BASICS

in
into
to a sum: m
X
into a sum: θ ML = arg max log pmodel (x(i) ; θ). (5.58)
θ i=1
Because the argmax do θ es not
does = arg max when
change log pwe rescale
(x ;theθ). cost function, w(5.58) e can
divide by m to obtain a version of the criterion that is expressed as an exp expectation
ectation
Because
with resp the
respect argmax do es not c hange when w e rescale the
ect to the empirical distribution pp̂ˆdata defined by the training data: cost function, we can
divide by m to obtain a version of the criterion that is expressed as an expectation
with respect to the empirical θ ML = arg max EX
distribution ∼p̂ p ˆ logdefined
p model(xb;yθthe). training data: (5.59)
θ
E
θ = arg max log p (x; θ). (5.59)
One wa way y to interpret maximum lik likelihoo
elihoo
elihood d estimation is to view it as minimizing
the dissimilarit
dissimilarity y bet
etwween the empirical distribution pp̂ˆdata defined by the training
One wa
set and the mo y to
model interpret maximum
del distribution, likelihoo
with d estimation
the degree is to view
of dissimilarit
dissimilarity y bitet
etwas
weenminimizing
the two
the dissimilarit y b et w een
measured by the KL divergence. The KL divthe empirical distribution
divergence p
ˆ
ergence is givgivendefined
en by b y the training
set and the model distribution, with the degree of dissimilarity between the two
measured D KLby (the kpmodel
pp̂ˆdataKL ) = E ∼p̂The [log
divergence. KL pp̂ˆdiv (x) − log
ergence
data pmodel
is giv en b(yx)] . (5.60)
The term E
D on (the pˆ left p is a)function
= only[logofpˆ the(x data
) loggenerating
p pro
process,
(x)] . cess, not the
(5.60)
mo
model.
del. This means when we train the mo modeldel to minimize the KL divergence, we
The k is a function only of the data − generating process, not the
needterm
only on the left
minimize
model. This means when we−train E ∼p̂the [logmodel to minimize
p model (x)] the KL divergence, we
(5.61)
need only minimize
whic
whichh is of course the same as E the maximization
[log p in
(x)]Eq. 5.59. (5.61)
Minimizing this
which is of course the same as KL div
divergence
ergence corresp
correspondsonds
− the maximization in Eq. 5.59.exactly to minimizing the cross-
en
entrop
trop
tropyy betwetweeneen the distributions. Many authors use the term “cross-en “cross-entrop
trop
tropy”y” to
iden Minimizing
identify
tify sp
specifically this KL
ecifically the negativ div ergence
negativee log-lik corresp
log-likeliho
eliho onds exactly to minimizing
elihoood of a Bernoulli or softmax distribution, the cross-
en trop y b etw een the distributions.
but that is a misnomer. Any loss consisting Many authors use the log-lik
of a negative term “cross-en
log-likelihoo
elihoo
elihood d trop
is a y” to
cross
iden
en tifyy sp
entrop
trop
tropy ecifically
betw
etweeneen the theempirical
negative log-lik elihood defined
distribution of a Bernoulli
by theortraining
softmax distribution,
set and the
but
mo that
model. is a misnomer. Any loss consisting
del. For example, mean squared error is the cross-entrop of a negative
cross-entropy log-lik
y b et
etw elihoo
ween the d is a cross
empirical
en trop y b etw een the
distribution and a Gaussian mo empirical distribution
model.
del. defined b y the training set and the
model. For example, mean squared error is the cross-entropy b etween the empirical
We can th thus us see maximum lik likeliho
eliho
elihoo o d as an attempt to make the mo modeldel dis-
distribution and a Gaussian model.
tribution matcmatch h the empirical distribution pp̂ˆdata . Ideally Ideally,, we would lik likee to match
W e can th us see maximum likeliho
the true data generating distribution pdata , but we ha o d as an attempt
hav to make
ve no direct accessthe model dis-
to this
tribution
distribution. matc h the empirical distribution p
ˆ . Ideally , we w ould lik e to match
the true data generating distribution p , but we have no direct access to this
While the optimal θ is the same regardless of whether we are maximizing the
distribution.
lik
likeliho
eliho
elihoo od or minimizing the KL divergence, the values of the ob objective
jective functions
While
are differen
different. the optimal
t. In soft θ
ware, we often phrase both as minimizing amaximizing
software, is the same regardless of whether w e are cost function. the
likeliho
Maxim
Maximum o d or minimizing
um likelihoo
likelihood the KL divergence, the v alues
d thus becomes minimization of the negative log-likof the ob jective functions
log-likeliho
eliho
elihoo od
are differen
(NLL), or equiv t. In
equivalently soft
alently ware, we often phrase
alently,, minimization of the cross entropb oth as minimizing
entropy a
y. The persp cost function.
perspective
ective of
Maxim
maxim
maximum um lik
um likelihoo
likelihoo
elihoo
elihood dd as
thus becomes
minimu
minimum m KL minimization
div ergence of
divergence the negative
becomes helpfullog-lik
in thiseliho od
case
(NLL),
becauseor theequiv
KL alently
div , minimization
divergence
ergence has a known of the
minim cross
minimum umentrop
value yof . zero.
The persp
The ective
negative of
maxim
log-lik um
log-likeliho
eliho
elihoo lik elihoo d as minimu m KL div
od can actually become negative when x is real-v ergence becomes helpful
real-valued.
alued. in this case
because the KL divergence has a known minimum value of zero. The negative
log-likelihood can actually become negative 132 when x is real-valued.
CHAPTER 5. MACHINE LEARNING BASICS

5.5.1 Conditional Log-Likelihoo


Log-Likelihoodd and Mean Squared Error

5.5.1
The maxim Conditional
maximum um lik
likeliho
eliho
elihooodLog-Likelihoo d and be
estimator can readily Mean Squared
generalized Error
to the case where
our goal is to estimate a conditional probabilit
probability y P (y | x ; θ) in order to predict y
The
giv maxim um likeliho o d estimator can readily
en x . This is actually the most common situation
given be generalized
because it formsto thethe
case where
basis for
our goal
most sup is to
supervised estimate a
ervised learning. If conditional
X probabilit
represen
represents y P (y x
ts all our inputs and; θ ) inY order to predict
all our observ
observedy
ed
given x . then
targets, This the
is actually the most
conditional common
maximum lik situation
likelihoo
elihoo
elihood b|ecause is
d estimator it forms the basis for
most supervised learning. If X represents all our inputs and Y all our observed
targets, then the conditional θ MLmaximum
= arg maxlikPelihoo
(Y | Xd estimator
; θ ). is (5.62)
θ
θ = arg max P (Y X ; θ). (5.62)
If the examples are assumed to b e i.i.d., then this can be decomposed into
|
Xm
If the examples are assumed to b e i.i.d., then this can be decomposed into
θ ML = arg max log P (y(i) | x(i) ; θ). (5.63)
θ i=1
θ = arg max log P (y x ; θ). (5.63)
| Linear regression,
in
intro
tro
troduced
duced earlier in Sec. 5.1.4, ma may y be justified as a maximum likelihoo likelihood d pro
procedure.
cedure.
Previously
Previously,, we motiv motivated
ated linear regression as an algorithm that Linear
learns regression,
to tak
takee an
X
intro duced
input x and pro earlier in Sec. 5.1.4 , ma y b e justified as a maximum
duce an output value yŷˆ. The mapping from x to yŷˆ is chosen to
produce likelihoo d pro cedure.
Previously , we
minimize mean squared motiv ated linear
error, regression
a criterion that asweanin algorithm
introtro
troduced
duced morethat learns
or less to take an.
arbitrarily
arbitrarily.
input
We no
now x and pro duce an output
w revisit linear regression from the poin v alue y
ˆ . The mapping from x to
ointt of view of maximum likelihoo yˆ is chosen
likelihoodto
d
minimize mean
estimation. squared
Instead of proerror,
producing
ducinga criterion
a singlethat we introyŷˆduced
prediction , we now more or less
think arbitrarily
of the mo del .
model
W
asepro
now revisita linear
producing
ducing regression
conditional from the
distribution p( ypoin
| xt). ofWviewe canofimagine
maximum thatlikelihoo
with and
estimation.
infinitely large Instead of pro
training set,ducing
we migh a single
might prediction
t see several yˆ, we examples
training now thinkwith of thethemo del
same
as pro ducing a conditional
input value x but differen distribution p( y x
differentt values of y . The goal of the learning algorithm is nowan
) . W e can imagine that with to
infinitely
fit large training
the distribution p( y | set,
x) to weallmigh t see differen
of those several
| ttraining
different y valuesexamples
that are with the same
all compatible
with To xderive
input xv.alue but differen
the same t values
linear of yregression
. The goalalgorithm
of the learning algorithmbefore,
we obtained is nowweto
fit the pdistribution
define p (
(y | x ) = N (yy; yŷˆ(xx); to w)all
, σ2of). those different yŷˆ(x
The function y values that
; w) giv es are
gives the all compatible
prediction of
with x. T o derive the |same linear regression algorithm
the mean of the Gaussian. In this example, we assume that the variance is fixed to w e obtained b efore, we
some p(y xt) σ=2 chosen
defineconstan
constant (y; yˆ(b xy; w ) , σuser.
the ). TheWe function
will see that yˆ(x ;this
w) cgiv es the
hoice prediction
of the of
functional
the mean
form of p(of y| |the
x ) Gaussian.
N
causes the In this example,
maximum lik
likeliho
eliho weodassume
elihoo estimation that thepro variancetoisyield
procedure
cedure fixedthe
to
some constan
same σ chosen b
learningt algorithm asy we
thedevuser.
developed We will
eloped see that
before. Sincethis thechoice
examples of thearefunctional
assumed
form of p ( y x ) causes the maximum
to be i.i.d., the conditional log-likelihoo
log-likelihood lik eliho o d estimation
d (Eq. 5.63) is giv given pro
en by cedure to yield the
same learning | algorithm as we developed before. Since the examples are assumed
to be i.i.d., the conditional Xm log-likelihood (Eq. 5.63) is given by
log p(y (i) | x (i); θ) (5.64)
i=1
log p(y x ; θ) Xm (5.64)
m |yŷˆ(i) − y (i)|| 2
= − m log σ − | log(2π ) − (5.65)
2 2σ2
m i=1 y ˆ y
= m log σ log(2π ) (5.65)
2 | 2
−σ ||
X
− − 133 −

X
CHAPTER 5. MACHINE LEARNING BASICS

where yŷˆ(i) is the output of the linear regression on the i-th input x(i) and m is the
num
umb ber of the training examples. Comparing the log-likelihoo log-likelihood d with the mean
where y
ˆ
squared error,is the output of the linear regression on the i-th input x and m is the
number of the training examples. Comparing 1 X (i) the (log-likelihoo
m d with the mean
squared error, MSE train = ||yŷˆ − y i)||2,
||ˆ (5.66)
m
1 i=1
MSE = yˆ y , (5.66)
we immediately see that maximizingmthe log-likelihoo
log-likelihood d with respect to w yields
the same estimate of the parameters w as do ||es minimizing
does − || the mean squared error.
w e immediately
The twtwo see
o criteria ha
hav that maximizing the log-likelihoo
ve different values but the same lo d with
location
cation of respect to w yields
the optimum. This
the same estimate of the parameters
justifies the use of the MSE as a maxim w
maximumas
umdo es minimizing the mean squared error.
X lik
likelihoo
elihoo
elihoodd estimation pro procedure.
cedure. As we
The tw o criteria hav
will see, the maximum like different v
likelihoo
elihoo
elihoodalues but the
d estimator has sevsame lo
severalcation of the
eral desirable prop optimum.
erties. This
properties.
justifies the use of the MSE as a maximum likelihood estimation procedure. As we
will see, the maximum likelihood estimator has several desirable properties.
5.5.2 Prop
Properties
erties of Maxim
Maximum
um Lik
Likeliho
eliho
elihoood

5.5.2
The mainPropapp
appeal erties
eal of theofmaxim
Maxim
maximum um um lik Likdeliho
likelihoo
elihoo
elihood od is that it can be sho
estimator shown
wn to
be the best estimator asymptotically
asymptotically,, as the num umberber of examples m → ∞ , in terms
The
of itsmain
rate app eal of
of conv
convergencethe maxim
ergence as m um likelihoo d estimator is that it can be shown to
increases.
be the best estimator asymptotically, as the number of examples m , in terms
Under appropriate conditions, maximum lik likeliho
eliho
elihoood estimator has the prop propertert
erty
y
of its rate of convergence as m increases. →∞
of consistency (see Sec. 5.4.5 ab abovov
ove),
e), meaning that as the number of training
Under approac
examples appropriate
approaches conditions,
hes infinit
infinity y, themaximum
maxim
maximum umliklik
eliho
likelihoododestimator
eliho
elihoo estimatehas of athe property
parameter
of
con
convconsistency
verges to the(see trueSec.
value5.4.5 abov
of the e), meaning
parameter. Thesethatconditions
as the numberare: of training
examples approaches infinity, the maximum likelihood estimate of a parameter
con•verges
The to thedistribution
true true value ofp data
the parameter.
must lie within Thesethe conditions
mo
model are: pmodel(·; θ).
del family
Otherwise, no estimator can recov recoverer pdata .
The true distribution p must lie within the model family p ( ; θ).
•• The true distribution
Otherwise, no estimator pdatacanmust corresp
correspond
recov er p ond . to exactly one value of θ. Other- ·
wise, maximum likelihoolikelihood d can recov er the correct pdata, but will not be able
recover
Thedetermine
to true distribution
which valuep ofmθ ustwas
corresp
usedond to exactly
by the one value of
data generating proθcessing.
. Other-
processing.
• wise, maximum likelihoo d can recover the correct p , but will not be able
to determine
There are other whichinductiv
inductivevalue of θ wasb esides
e principles used by thethe dataum
maxim
maximum generating
lik
likelihoo
elihoo
elihood pro
d cessing.
estimator,
man
many y of which share the prop property
erty of being consistent estimators. Ho How wev
ever,
er, consis-
ten There are other inductiv e principles b esides the maxim um
tentt estimators can differ in their statistic efficiency, meaning that one consisten likelihoo d estimator,
consistentt
man y of
estimator ma which
may share
y obtain lowthe prop
lower erty of b eing consistent
er generalization error for a fixed num estimators.
umb Ho w ev er, consis-
ber of samples m,
ten t estimators
or equiv
equivalen
alen
alently
tly can
tly,, ma
may differ in their statistic efficiency
y require fewer examples to obtain a fixed lev , meaning that
el of generalizationt
level one consisten
estimator may obtain lower generalization error for a fixed number of samples m ,
error.
or equivalently, may require fewer examples to obtain a fixed level of generalization
error.Statistical efficiency is typically studied in the par arametric
ametric case (lik (likee in linear
regression) where our goal is to estimate the value of a parameter (and assuming
it isStatistical
possible toefficiency
identify the is ttrue
ypically studied in
parameter), notthethepar ametric
value case (like Ainwlinear
of a function. ay to
regression)
measure ho how where our goal is to estimate the v
w close we are to the true parameter is by the expalue of a parameter
expected (and assuming
ected mean squared
it is possible to identify the true
error, computing the squared difference betw parameter), not
etween the v alue of
een the estimated and a function. A way to
true parameter
measure how close we are to the true parameter is by the expected mean squared
error, computing the squared difference 134 between the estimated and true parameter
CHAPTER 5. MACHINE LEARNING BASICS

values, where the exp ectation is over m training samples from the data generating
expectation
distribution. That parametric mean squared error decreases as m increases, and
vfor
alues, wherethe
m large, theCramér-Rao
expectationlow
is er
lowerover m training
bound (Rao, 1945samples from, 1946
; Cramér the data generating
) shows that no
distribution.
consisten That parametric
consistentt estimator has a low
lower mean squared error decreases as m
er mean squared error than the maximum lik increases,
likelihoand
eliho
elihoood
for m large,
estimator. the Cramér-Rao low er b ound ( Rao , 1945; Cramér , 1946 ) shows that no
consistent estimator has a lower mean squared error than the maximum likelihood
For these reasons (consistency and efficiency), maximum lik likelihoo
elihoo
elihoodd is often
estimator.
considered the preferred estimator to use for machine learning. When the num numberber
F or these reasons (consistency and efficiency), maximum likelihoo
of examples is small enough to yield overfitting behavior, regularization strategiesd is often
considered
suc
such the
h as weigh
weight preferred
t deca y mayestimator
decay be used totoobtain
use for machine
a biased learning.
version When the
of maximum liknum
likeliho
elihober
elihoood
of examples
that has lessisvariance
small enough to yield odata
when training verfitting behavior, regularization strategies
is limited.
such as weight decay may be used to obtain a biased version of maximum likelihood
that has less variance when training data is limited.
5.6 Ba
Bay
yesian Statistics

5.6far wBa
So e ha yveesian
hav discussed Statistics
fr
freequentist statistics and approac approaches hes based on estimating a
single value of θ, then making all predictions thereafter based on that one estimate.
So far weapproach
Another have discussed frequentist
is to consider all statistics
p ossible vandaluesapproac
of θ when hes based
making on aestimating
prediction. a
single value of θ , then making all
The latter is the domain of Bayesian statistics. predictions thereafter based on that one estimate.
Another approach is to consider all p ossible values of θ when making a prediction.
As discussed in Sec. 5.4.1, the frequentist persp perspectiv
ectiv
ectivee is that the true parameter
The latter is the domain of Bayesian statistics.
value θ is fixed but unknown, while the poin ointt estimate θ θ̂ˆ is a random variable on
As discussed
accoun
account t of it being in Sec. 5.4.1, the
a function frequentist
of the dataset persp
(whic
(which ectiv
h is eseen
is that the true parameter
as random).
value θ is fixed but unknown, while the point estimate θˆ is a random variable on
Thet Ba
accoun Bayyesian
of it beingpaersp
erspective
ective on
function of thestatistics
datasetis(whic
quiteh different.
is seen as The random).Ba
Bay yesian uses
probabilit
probability y to reflect degrees of certaint certainty y of states of kno knowledge.
wledge. The dataset is
The observed
directly Bayesian and persp soective
is not on statistics
random. On is thequite
other different.
hand, theThe trueBaparameter
yesian uses θ
probabilit
is unkno
unknown y to reflect degrees of
wn or uncertain and thus is represen certaint y of
represented states of kno wledge.
ted as a random variable. The dataset is
directly observed and so is not random. On the other hand, the true parameter θ
Before observing the data, we represent our knowledge of θ using the prior
is unknown or uncertain and thus is represented as a random variable.
pr
prob
ob
obability
ability distribution
distribution,, p (θ ) (sometimes referred to as simply “the prior”). Gen-
Before
erally
erally,, the macobserving
machine the data,
hine learning we represent
practitioner ouraknowledge
selects of θ using
prior distribution thattheis prior
quite
pr ob ability distribution
broad (i.e. with high en , p
entrop
trop (
tropy) θ ) (sometimes referred to
y) to reflect a high degree of uncertain as simply “the prior”).
uncertaintty in the value Gen-
of
erally , the mac
θ before observing an hine learning
any practitioner
y data. For example, one migh selects a prior distribution
mightt assume a priori that θ lies that is quite
broad
in some (i.e. withrange
finite high orentrop
volume, y) towithreflect a high degree
a uniform of uncertain
distribution. Man
Many tyy in the vinstead
priors alue of
θ before
reflect observing an
a preference fory “simpler”
data. Forsolutionsexample,(suc oneh migh
(such t assume
as smaller a priori co
magnitude that θ lies
coefficients,
efficients,
in some
or finite that
a function range is or volume,
closer to beingwithconstant).
a uniform distribution. Many priors instead
reflect a preference for “simpler” solutions (such as smaller(1) magnitude coefficients,
No
Now w consider that we hav havee a set of data samples {x , . . . , x (m) }. We can
or a function that is closer to being constant).
reco
recov ver the effect of data on our belief ab out θ by com
about combining
bining the data lik likeliho
eliho
elihoo od
No
(1) w consider
p(x , . . . , x ( m ) that we hav e
| θ) with the prior via Bay a set of data
Bayes’ samples
es’ rule: x , . . . , x . W e can
recover the effect of data on our belief about θ by combining { the data} likelihood
p(x , . . . , x θ) with(1)the prior via Bay
p ( x (1),rule:
es’ . . . , x (m) | θ )p(θ )
p(θ | x , . . . , x (m) ) = (5.67)
| p(x(1), . . . , x(m))
p(x , . . . , x θ)p(θ )
p(θ x , . . . , x ) =135 (5.67)
p(x , . . . , x| )
|
CHAPTER 5. MACHINE LEARNING BASICS

In the scenarios where Ba Bayyesian estimation is typically used, the prior b egins as a
relativ
relatively
ely uniform or Gaussian distribution with high en entrop
trop
tropyy, and the observ
observation
ation
In the scenarios where Ba y esian estimation
of the data usually causes the posterior to lose entrop is typically
entropy used, the
y and concen prior
concentrate b egins
trate aroundas aa
relativ
few ely uniform
highly or Gaussian
likely values distribution with high entropy, and the observation
of the parameters.
of the data usually causes the posterior to lose entropy and concentrate around a
Relativ
Relativee to maxim
maximum um liklikelihoo
elihoo
elihood d estimation, Ba Bayyesian estimation offers two
few highly likely values of the parameters.
imp
importan
ortan
ortantt differences. First, unlik unlikee the maxim
maximumum likelihoo
likelihood d approach that makes
Relativ e to maxim um likelihoo d
predictions using a point estimate of θ , the Ba estimation,
Bay Ba yesian
yesian approac
approach hestimation offers two
is to make predictions
importan
using t differences.
a full distribution First,
ov erunlik
over θ. F e or
theexample,
maximum likelihoo
after d approach
observing that makes
m examples, the
predictions using a p
predicted distribution ovoint estimate
over of θ , the Bay
er the next data sample, xesian approac
(m +1) h is to make
, is given by predictions
using a full distribution over θ.Z For example, after observing m examples, the
predicted distribution over the next data sample, x , is given by
p(x (m+1) | x (1) , . . . , x(m)) = p(x (m+1) | θ)p(θ | x(1), . . . , x (m) ) dθ. (5.68)

p(x x , . . . , x ) = p(x θ)p(θ x , . . . , x ) dθ. (5.68)


Here eac h value of θ with positive probability density contributes to the prediction
each
|
of the next example, with the contribution w|eigh eightedted |by the posterior density itself.
Here eac
After ha h value
having
ving of θ ed
observ
observed with{xpositive probability
(1) , . . . , x(m)}
, if we density
are stillcontributes to the ab
quite uncertain prediction
about
out the
of the next example, with
value of θ , then this uncertain the contribution
Z weigh ted
uncertaintty is incorporated directly into an by the p osterior
any density
y predictions itself.
we
After
migh ha
mightt mak ving
make.e. observ ed x , . . . , x , if we are still quite uncertain ab out the
value of θ , then this uncertain { ty is incorporated
} directly into any predictions we
In Sec. 5.4, we discussed ho how w the frequen
frequentisttist approach addresses the uncertaint uncertainty y
might make.
in a giv en point estimate of θ by ev
given evaluating
aluating its variance. The variance of the
In Sec. 5.4 , w e discussed ho w
estimator is an assessment of how the estimate the frequen tist approach
might addresses
change with the uncertaint
alternativey
in a giv en p oint
samplings of the observ estimate
observed of θ
ed data. The Bab y ev aluating
Bay its variance. The
yesian answer to the question of ho v ariance
how of deal
w to the
estimator
with is an assessment
the uncertaint
uncertainty of how the
y in the estimator is toestimate
simply in might
integrate
tegrate change
over it,withwhic
which alternative
h tends to
samplings of the
protect well against ovobserv ed data.
overfitting. The Ba y esian answer to the question
erfitting. This integral is of course just an application of ho w to dealof
with the uncertaint
the laws of probabilit
probabilityy in the estimator
y, making the Ba is
Bay to simply
yesian approac
approachin tegrate o v er it,
h simple to justify whic h tends
justify,, while the to
protect
frequen tist machinery for constructing an estimator is based on the application
frequentistwell against ov erfitting. This integral is of course just an rather ad ho ofc
hoc
the laws to
decision of summarize
probability, all making
knowledgethe Bacon yesian
contained approac
tained in theh dataset
simple to withjustify , while
a single the
point
frequentist machinery for constructing an estimator is based on the rather ad ho c
estimate.
decision to summarize all knowledge contained in the dataset with a single point
The second imp importan
ortan
ortantt difference b et etwween the Bay Bayesian
esian approac
approach h to estimation
estimate.
and the maximum likelihoo likelihood d approac
approach h is due to the contribution of the Ba Bayyesian
priorThe second impThe
distribution. ortanprior
t difference b etween the
has an influence Bayesianprobability
by shifting approach to estimation
mass densit
density y
and
to
tow the maximum likelihoo d approac h is due to the contribution
wards regions of the parameter space that are preferred a priori. In practice, of the Ba y esian
prior distribution.
the prior The prior
often expresses has an influence
a preference for models by that
shiftingare probability
simpler or more mass smo
densit
smooth. y
oth.
to w ards regions
Critics of the Bay of
Bayesian the parameter space that are
esian approach identify the prior as a source of subpreferred a priori . In
subjective practice,
jective human
the priortoften
judgmen
judgment expresses
impacting a preference for models that are simpler or more smooth.
the predictions.
Critics of the Bayesian approach identify the prior as a source of sub jective human
Ba
Bay yesian metho
methods ds typically generalize muc much h better when limited training data
judgment impacting the predictions.
is av
available,
ailable, but typically suffer from high computational cost when the num umb ber of
Ba yesian metho
training examples is large.ds typically generalize muc h b etter when limited training data
is available, but typically suffer from high computational cost when the number of
training examples is large.
136
CHAPTER 5. MACHINE LEARNING BASICS

Here we consider the Ba Bay


yesian esti-
mation approach to learning the linear regression parameters. In linear regression,
we learn a linear mapping from an input vector Here x∈ weRconsider the the
n to predict Bayv
esian
alue esti-
of a
mation approach to learning the linear regression
scalar y ∈ R. The prediction is parametrized by the vRector w ∈ R :parameters. In linear
n regression,
we learn a linear mapping from an input vector x to predict the value of a
R > R
scalar y . The prediction is parametrized
yŷˆ = w x. by the ∈ vector w : (5.69)
∈ ∈
Giv en a set of m training samples (Xyˆ(=
Given w) , y
train x(.train) ), we can express the prediction
(5.69)
of y ov
over
er the enentire
tire training set as:
Given a set of m training samples (X ,y ), we can express the prediction
of y over the entire training sety ( train ) ( train )
ŷˆ as: = X w. (5.70)


Expressed as a Gaussian conditional =X w.on y(train), we hav
distribution havee (5.70)

p(y(train) | as
Expressed X(atrain )
Gaussian N (y (train); X
, w) = conditional (train)
distribution
w, I ) on y , we have (5.71)
 
1 (train) (train) > ( train) ( train )
p(y X , w) ∝ = exp (y − (y ;X −wX, I ) w) (y −X (5.71)
w) ,
2
| N 1
exp (y X w) (y X (5.72)
w) ,
2
∝ − − − (5.72)
where we follow the standard MSE formulation in assuming that the Gaussian
variance on y is one. In what follo follows,
ws, to reduce the notational burden, we refer to
where
(X ( we
train),yfollow
( train)
) as simply (X , y ). formulation in assuming that the Gaussian
the standard  MSE 
variance on y is one. In what follows, to reduce the notational burden, we refer to
(X To determine
,y ) the posterior
as simply (Xdistribution
, y ). over the mo del parameter vector w, we
model
first need to sp specify
ecify a prior distribution. The prior should reflect our naive belief
ab T
about o determine the p osterior distribution ov er the
out the value of these parameters. While it is sometimes difficult or mo del parameter vector w, we
unnatural
first
to need to
express ourspprior
ecify baeliefs
priorindistribution.
terms of theThe prior should
parameters of thereflect
mo del,our
model, in naive
practice belief
we
ab out the v alue of these parameters. While it is sometimes
typically assume a fairly broad distribution expressing a high degree of uncertaindifficult or unnatural
uncertaintty
to
ab express
about θ
out . F our
For prior
or real-v b
real-valued eliefs in terms of the parameters of the mo
alued parameters it is common to use a Gaussian as a priordel, in practice we
typically assume a fairly broad distribution expressing a high degree of uncertainty
distribution:
about θ . For real-valued parametersit is common to use a Gaussian  as a prior
distribution: 1
p(w) = N (w; µ0 , Λ 0) ∝ exp − (w − µ0 )> Λ−1 0 (w − µ0) (5.73)
2
1
p(w) = (w; µ , Λ ) exp (w µ ) Λ (w µ ) (5.73)
where µ0 and Λ 0 are the prior distribution 2 mean vector and cov covariance
ariance matrix
resp
respectiv
ectiv ely..1 N
ectively
ely ∝ − − −
where µ and Λ are the prior distribution mean vector and covariance matrix
respWith the. prior th
ectively thus
us sp
specified,
ecified, we can no

noww pro
proceed
ceed in determining the posterior

distribution over the mo model
del parameters.
With the prior thus specified, we can now proceed in determining the posterior
distribution
p(w | X , y) ∝ over
p(ythe | Xmo, wdel
)p(parameters.
w) (5.74)

p(wUnless
X, y ) isp(ayreason
there X, wto)p (5.74)a
(w) a particular covariance structure, we typically assume
assume
diagonal covariance matrix .
| ∝ |
137
CHAPTER 5. MACHINE LEARNING BASICS

   
1 > 1 > −1
∝ exp − (y − X w) (y − X w) exp − (w − µ 0) Λ 0 (w − µ 0 )
2 2
1 1
exp (y X w) (y X w) exp (w µ ) Λ (w (5.75) µ )
 2 2  
∝ −1 − − − − −
∝ exp − −2y >X w + w > X > X w + w> Λ−1 0 w − 2µ0 Λ 0 w
> −1 (5.75)
.
2
1
exp  2y X w + w X X  w +w Λ w 2µ Λ w(5.76) . 
2
∝ − −  −1  > − −1  (5.76)
We no w define Λ m = X> X + Λ−1
now 0 and µm = Λ m X y + Λ0 µ0 . Using
these new variables, we find that the posterior ma may y be rewritten as a Gaussian
W e now define Λ = X
distribution:
 X+Λ and µ = Λ X y + Λ µ . Using 
these new variables, wefind that the posterior may be rewritten asa Gaussian
distribution: 1 1 > −1
p(w | X , y) ∝ exp − (w − µm )> Λ−1 m (w − µ m) + µm Λm µ m (5.77)
2 2
 1  1
p(w X , y) exp 1 (w µ )> Λ−1(w µ ) +  µ Λ µ  (5.77)
∝ exp − 2 (w − µm ) Λm (w − µ m) . 2 (5.78)
| ∝ −2 − −
1
exp (w µ ) Λ (w µ ) . (5.78)
All terms that do not include 2 the parameter vector w ha havve been omitted; they

∝ fact that− the − − be normalized to in 
are implied by the distribution must integrate
tegrate to 1.
All terms that do not include
Eq. 3.23 shows how to normalize a multiv the parameter
multivariate v ector w ha v e b
ariate Gaussian distribution.een omitted; they
are implied by the fact  that the distribution must be normalized
 to integrate to 1.
Examining this posterior distribution allows us to gain some in intuition
tuition for the
Eq. 3.23 shows how to normalize a multivariate Gaussian distribution.
effect of Bay esian inference. In most situations, we set µ0 to 0. If we set Λ0 = α1 I ,
Bayesian
thenExamining
µm giv
gives thissame
es the posterior distribution
estimate of w as allows
do
does us to gain linear
es frequentist some in tuition forwith
regression the
aeffect oftBay
weigh
weight esian
decay inference.
penalt
enalty y of αInw >most situations,
w . One we set
difference µ tothe
is that 0. Bay set Λ
If weesian
Bayesian = I,
estimate
then
is µ givesif the
undefined alphasame estimate
is set of w as are
to zero—-we doesnotfrequentist
allo
allow
wed to linear
begin regression
the Ba
Bay with
yesian
a weigh t
learning pro decay p enalt y of α w w . One difference
cess with an infinitely wide prior on w. The more imp
process is that the Bay esian
important estimate
ortant difference
is that
is undefined
the BaBay alphaestimate
ifyesian is set toprozero—-we
provides
vides a cov are not allo
covariance
ariance wed to
matrix, begin the
showing how Ba
how yesian
like
likely
ly all
learning
the differenpro cess with an infinitely wide
differentt values of w are, rather than pro prior on w
providing. The more imp
viding only the estimate µm.ortant difference
is that the Bayesian estimate provides a covariance matrix, showing how likely all
the different values of w are, rather than providing only the estimate µ .
5.6.1 Maxim
Maximum
um (MAP) Estimation

5.6.1 theMaxim
While um
most principled approac
approach h is (MAP)
to mak Estimation
makee predictions using the full Bay
Bayesian
esian
posterior distribution over the parameter θ , it is still often desirable to ha havve a
While
single the most
point principled
estimate. Oneapproac
common h isreason
to makfor e predictions
desiring a using
point the full Bay
estimate isesian
that
p osterior
most op distribution
operations
erations inv o v
involvinger the
olving the Ba parameter
Bay θ , it is still
yesian posterior for most in often desirable
interesting
teresting mo to dels eare
ha
models v a
single
in point and
intractable,
tractable, estimate.
a pointOne common
estimate reason
offers for desiring
a tractable a point estimate
approximation. Rather is than
that
most op erations involving
simply returning to the maxim the Ba
maximum yesian posterior
um likelihoo
likelihood d estimate, we can still gain someare
for most interesting mo dels of
intractable,
the benefit ofand theaBay
point
Bayesianestimate
esian approacoffers
approach a tractable
h by allowing theapproximation.
prior to influence Rather than
the choice
simply
of the preturning
oin to theOne
ointt estimate. maxim um likelihoo
rational way to ddoestimate,
this is towe can still
choose the gain some of
maximum a
the b enefit of the Bay esian approac h b y allowing the
posteriori (MAP) point estimate. The MAP estimate chooses the p oinprior to influence the choice
ointt of maximal
of the point estimate. One rational way to do this is to choose the maximum a
posteriori (MAP) point estimate. The MAP 138 estimate chooses the p oint of maximal
CHAPTER 5. MACHINE LEARNING BASICS

posterior probability (or maximal probability densit


density
y in the more common case of
con
contin
tin
tinuous
uous θ):
posterior probability (or maximal probability density in the more common case of
continuous θ): = arg max p(θ | x) = arg max log p(x | θ) + log p(θ ).
θ MAP (5.79)
θ θ
θ = arg max p(θ x) = arg max log p(x θ) + log p(θ ). (5.79)
We recognize, ab aboove on the right hand side, log p(x | θ ), i.e. the standard log-
lik
likeliho
eliho
elihoo od term, and log p(θ),| corresp corresponding
onding to the prior | distribution.
We recognize, above on the right hand side, log p(x θ ), i.e. the standard log-
As an
likeliho od example,
term, andconsider
log p(θ)a, linear
corresp regression
onding 1tomo model
thedelprior
withdistribution.
a Gaussian prior on the
2 |
weigh
eightsts w . If this prior is giv givenen by N ( w; 0, λ I ), then the log-prior term in Eq.
As an
5.79 is prop example,
proportional consider
ortional to the familiar a linearλregression
w > w weigh mo
eight t del
decaywith a Gaussian
p enalt
enaltyy, plus aprior
termonthatthe
wdoeigh
does ts wdep
es not . Ifend
depend thisonprior
w and is giv
do
doesen
es bnot ( w; 0,theI learning
y affect ), then the pro log-prior
process.
cess. MAP term
Ba
Bay yinesian
Eq.
5.79 is prop
inference withortional to the prior
a Gaussian familiaron λthe
w wweigh
N weigh
eightsts tht us
thusdecay p enalt
corresp
corresponds y, plus
onds a term
to weigh
weight that.
t decay
decay.
does not depend on w and does not affect the learning process. MAP Bayesian
As with
inference withfullaBay
Bayesian
esian inference,
Gaussian prior on the MAP Bay
Bayesian
weigh esian
ts thusinference
correspondshas the adv
advan
to weigh an
antage
t tage
decay of.
lev
leveraging
eraging information that is brough broughtt by the prior and cannot be found in the
As with
training data.full This
Bayesian inference,
additional MAP Bayhelps
information esian to inference
reducehasthethe advantage
variance in theof
leveraging
MAP ointtinformation
poin estimate (inthat is brough
comparison tot the
by the
ML prior and cannot
estimate). How
Howev ev ber,
ever,e found
it do
doesesinsothe
at
training data. This
the price of increased bias. additional information helps to reduce the variance in the
MAP point estimate (in comparison to the ML estimate). However, it does so at
the Man
Many
pricey of
regularized
increased estimation
bias. strategies, such as maxim maximum um lik
likelihoo
elihoo
elihood d learning
regularized with weigh weightt deca
decay y, can be in interpreted
terpreted as making the MAP appro approxima-
xima-
Man
tion to Ba y
Bayregularized estimation strategies, such as maxim
yesian inference. This view applies when the regularization consists of um likelihoo d learning
regularized
adding an extra with term
weighto t deca
the yob, can
objectivebe in
jective terpreted
function thatas making the MAP
corresponds to logappro
p(θ ).xima-
Not
tion to Ba y esian inference.
all regularization penalties corresp This view
correspond applies
ond to MAP Ba whenBay the regularization consists
yesian inference. For example, of
addingregularizer
some an extra term terms tomaythe ob notjective
be thefunction
logarithmthat of corresponds
a probability to log p(θ ). Not
distribution.
all regularization
Other regularization penalties correspon
terms depend ondtheto data,
MAPwhich Bayesian inference.
of course a prior For example,
probability
some regularizer
distribution is notterms
allow
allowedmay
ed to not
do. be the logarithm of a probability distribution.
Other regularization terms depend on the data, which of course a prior probability
MAP Bay
distribution Bayesian
isesian inference
not allow ed topro
provides
vides a straigh
do. straightforw
tforw
tforwardard way to design complicated
yet ininterpretable
terpretable regularization terms. For example, a more complicated penalty
termMAP can bBay esian inference
e derived by using pro vides a of
a mixture straigh tforward
Gaussians, way than
rather to design
a singlecomplicated
Gaussian
ydistribution,
et interpretable regularization
as the prior (Nowlan and Hinterms. F or
Hintonexample,
ton, 1992). a more complicated penalty
term can be derived by using a mixture of Gaussians, rather than a single Gaussian
distribution, as the prior (Nowlan and Hinton, 1992).
5.7 Sup
Supervised
ervised Learning Algorithms

5.7 from
Recall SupSec.ervised Learning
5.1.3 that sup ervised Algorithms
supervised learning algorithms are, roughly sp speaking,
eaking,
learning algorithms that learn to asso associate
ciate some input with some output, giv given
en a
Recall from Sec. 5.1.3 that sup ervised
x learning algorithms
training set of examples of inputs and outputs . In man y are,
many roughly sp eaking,
y cases the outputs
learning
y ma
mayy balgorithms
e difficult tothat learnautomatically
collect to asso ciate some
and input
must with some output,
be provided by a giv en a
human
training
“sup set ofbut
“supervisor,”
ervisor,” examples
the termof still
inputs x andeven
applies whenythe
outputs . Intraining
many cases the outputs
set targets were
y ma y b e difficult
collected automaticallyto
automatically..collect automatically and must be provided b y a h uman
“supervisor,” but the term still applies even when the training set targets were
collected automatically. 139
CHAPTER 5. MACHINE LEARNING BASICS

5.7.1 Probabilistic Sup


Supervised
ervised Learning

5.7.1 supervised
Most Probabilistic
learningSup ervised in
algorithms Learning
this bo
book
ok are based on estimating a
probabilit
probabilityy distribution p(y | x). We can do this simply by using maxim maximum
um
Most
lik
likeliho
elihosupervised
elihoood estimation to find the best parameter vector θ for a parametric familya
learning algorithms in this bo ok are based on estimating
probabilit y distribution
of distributions p(y | x; θ)p.(y x). We can do this simply by using maximum
likelihood estimation to find the | best parameter vector θ for a parametric family
W e ha
hav
v e already seen
of distributions p(y x; θ). that linear regression corresp
corresponds
onds to the family

We have already| seen that = N (y; θ>corresp


| x; θ)regression
p(y linear x, I ). onds to the family (5.80)

We can generalize linear regressionp(y x; θ)to = the(yclassificat


; θ x, I ). ion scenario by defining
classification (5.80)a
differen
differentt family of probability |distributions. N If we hahav ve tw
two o classes, class 0 and
W e can
class generalize
1, then we needlinear
onlyregression
specify the to the classificat
probabilit
probability y ofiononescenario
of thesebclasses.
y defining Thea
differen
probabilit
probabilityt family of 1probability
y of class determinesdistributions.
the probabilityIfofwclass e hav0,e b
tw o classes,
ecause theseclass
twoo v0alues
tw and
class
m ust 1,
addthen we1.need only specify the probability of one of these classes. The
up to
probability of class 1 determines the probability of class 0, because these two values
must Theaddnormal
up to 1.distribution over real-v
real-valued
alued numnumb bers that we used for linear
regression is parametrized in terms of a mean. An Anyy value we supply for this mean
The A
is valid. normal distribution
distribution over real-v
over a binary alued
variable numbersmore
is slightly thatcomplicated,
we used forbecause
linear
regression
its mean misust parametrized
alw
alwaays be betinwterms
etw of a 1.
een 0 and mean.
One Anwayy vtoalue
solvweethis
solve supply for this
problem is tomean
use
is v alid. A distribution ov er a binary v ariable is slightly more
the logistic sigmoid function to squash the output of the linear function in complicated, b ecause
into
to the
itsterv
in mean
interv
terval must
al (0, alwain
1) and ysterpret
be betw
interpret een v0alue
that and as 1. aOne way to solve this problem is to use
probability:
the logistic sigmoid function to squash the output of the linear function into the
interval (0, 1) and interpretpthat (y = v1alue
| x; as σ (θ >x).
θ) a=probability: (5.81)

This approac
approach h is known as lo p(gistic
y = 1regr
logistic xession
gression x).
; θ) = σ(a(θsomewhat (5.81)
strange name since we
use the mo
model
del for classification rather| than regression).
This approach is known as logistic regression (a somewhat strange name since we
In the case of linear regression, we were able to find the optimal weigh eights
ts by
use the model for classification rather than regression).
solving the normal equations. Logistic regression is somewhat more difficult. There
In closed-form
is no the case of linear regression,
solution we were
for its optimal ablets.to Instead,
weigh
weights. find the weoptimal
must wsearch
eights for
by
solving the normal equations.
them by maximizing the log-lik Logistic
log-likelihoo
elihoo
elihood. regression is somewhat more difficult.
d. We can do this by minimizing the negative There
is no closed-form
log-lik
log-likeliho
eliho
elihoood (NLL)solution for its optimal
using gradient descen
descent.t.weights. Instead, we must search for
them by maximizing the log-likelihood. We can do this by minimizing the negative
This
log-lik same
eliho od strategy can bgradient
(NLL) using e applied descen
to essen
essentially
t.tially an
anyy sup
supervised
ervised learning problem,
by writing down a parametric family of conditional probability distributions over
the This
righttsame
righ kindstrategy
of inputcan
andbeoutput
appliedvariables.
to essentially any supervised learning problem,
by writing down a parametric family of conditional probability distributions over
the right kind of input and output variables.
5.7.2 Supp
Support
ort Vector Mac
Machines
hines

5.7.2of the
One Supp
mostort Vectorapproaches
influential Machines to sup
supervised
ervised learning is the supp
support
ort vector
mac
machine
hine (Boser et al., 1992; Cortes and Vapnik, 1995). This mo model
del is similar to
One of the most influential approaches to sup ervised learning
> is the supp
logistic regression in that it is driven by a linear function w x + b. Unlik ort
Unlike vector
e logistic
machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to
logistic regression in that it is driven by 140
a linear function w x + b. Unlike logistic
CHAPTER 5. MACHINE LEARNING BASICS

regression, the support vector machine do does


es not pro
provide
vide probabilities, but only
outputs a class iden
identit
tit
tity
y. The SVM predicts that the positiv ositivee class is presen
presentt when
regression,
> the
w x + b is positiv support
ositive. vector machine do es not pro
e. Likewise, it predicts that the negativ vide probabilities,
negativee class is presen but
present only
t when
outputs
w >
x + baisclass identity. The SVM predicts that the positive class is present when
negative.
w x + b is positive. Likewise, it predicts that the negative class is present when
w x One key inno
innovvation asso
associated
ciated with supp
support
ort vector machines is the kernel trick.
+ b is negative.
The kernel tric
trick
k consists of observing that many mac machine
hine learning algorithms can
One key inno
be written exclusiv vation
exclusively asso ciated with
ely in terms of dot pro supp ort
products vector
ducts betw eenmachines
between examples. is the
For kernel
example,trick
it.
The
can bkernel
e showntricthat
k consists of observing
the linear function that
usedmany
by themac hine
supp
supportortlearning algorithms
vector mac hine cancan
machine be
b e written
re-written asexclusiv ely in terms of dot pro ducts betw een examples. F or example, it
can be shown that the linear function usedX m
by the support vector machine can be
>
re-written as w x + b = b + αi x> x (i) (5.82)
i=1
w x+b = b+ αx x (5.82)
where x (i) is a training example and α is a vector of co coefficien
efficien
efficients. ts. Rewriting the
learning algorithm this way allo ws us to replace x by the output of a giv
allows givenen feature
where x is a training
function φ(x ) and the dot pro example and α is a vector of co efficien ts.
duct with a function k (x, x ) = φ( x) · φ (x(i) ) called
product ( i ) Rewriting the
alearning
kernel.algorithm
The · op this warepresen
operator
erator y allowstsusan
represents toinner
replace
X pro xduct
by the
product output oftoaφgiv
analogous (xen)>φfeature
(x(i)).
function φ ( x ) and the
For some feature spaces, we ma dot pro duct
may with a function k (x , x
y not use literally the vector inner pro ) = φ( x) φ ( x
product.) called
duct. In
a kernel
some . Thedimensional
infinite operator represen
spaces, w tse an
needinner
to useproother
duct analogous
kinds of inner to· φpro
( x ) φ (x for
products,
ducts, ).
For someinner
example, feature· pro spaces,
products
ducts based we ma onyin not use literally
integration
tegration rather thethanvector inner pro
summation. duct. In
A complete
some
dev infinitet dimensional
developmen
elopmen
elopment of these kinds spaces,
of innerwe pro
need
productsto use
ducts is bother
eyondkinds the scopeof inner of pro
thisducts,
bo
book.ok.for
example, inner pro ducts based on integration rather than summation. A complete
After replacing dot products with kernel ev evaluations,
aluations, we can make predictions
development of these kinds of inner pro ducts is beyond the scope of this bo ok.
using the function
After replacing dot products with kX ernel evaluations, we can make predictions
f (x) = b + α ik(x, x(i) ). (5.83)
using the function i
f (x) = b + α k(x, x ). (5.83)
This function is nonlinear with resp respect
ect to x, but the relationship betw betweeneen φ( x)
and f(x) is linear. Also, the relationship betw betweeneen α and f (x) is linear. The
This function is nonlinear
kernel-based function is exactly equiv with resp ect
equivalent to x , but
alent to prepro thecessing
preprocessingrelationship
the data betw
by een φ( x)
applying
φ (x) fto
and (xall
) isinputs,
linear.then Also, the relationship
learning a linear between α and f (x) is linear. The
Xmodel in the new transformed space.
kernel-based function is exactly equivalent to preprocessing the data by applying
The kernel tric trickk is pow owerful
erful for twtwoo reasons. First, it allows us to learn mo models
dels
φ(x) to all inputs, then learning a linear model in the new transformed space.
that are nonlinear as a function of x using conv convex ex optimization techniques that are
guaranTheteed
guaranteed kernel trickerge
to conv
converge is pefficiently
owerful for
efficiently. two reasons.
. This is possibleFirst,becauseit allows us to learn
we consider φ fixed moanddels
that are nonlinear
optimize only α, i.e., as athe function of x using
optimization convex optimization
algorithm can view thetechniques that are
decision function
guaran teed to conv erge efficiently . This is p ossible b ecause
as being linear in a different space. Second, the kernel function k often admits w e consider φ fixed and
optimize
an implemen
implementation α, i.e.,
onlytation thattheisoptimization
significantly algorithm can view the
more computational decision
efficien
efficient t thanfunction
naiv
naively
ely
as b eing linear in a different space. Second,
constructing two φ(x) vectors and explicitly taking their dot pro the k ernel function k
product. often
duct. admits
an implementation that is significantly more computational efficient than naively
In some cases,
constructing two φ(φx(x can even
) )vectors andbexplicitly
e infinite taking
dimensional,
their dot whic
which
proh duct.
would result in
an infinite computational cost for the naiv naive, e, explicit approach. In man many y cases,
k(xIn , x0 some
) is a cases,
nonlinear, can evenfunction
φ(x)tractable be infinite of xdimensional,
ev
even
en when φ whic
(x ) h
is would
in result
tractable. As
intractable. in
an infinite computational cost for the naive, explicit approach. In many cases,
k(x, x ) is a nonlinear, tractable function 141 of x even when φ (x) is intractable. As
CHAPTER 5. MACHINE LEARNING BASICS

an example of an infinite-dimensional feature space with a tractable kernel, we


construct a feature mapping φ (x ) over the non-negative in tegers x . Supp
integers Suppose
ose that
an example of an infinite-dimensional
this mapping returns a vector con
containingfeature space
taining x ones follow with
followed a tractable kernel,
ed by infinitely many zeros. we
construct
W a feature
e can write mapping
a kernel φ (x over
function k)(x, x(i)the
) = non-negative integers
min(( x, x(i) ) that
min x . Supp
is exactly osealent
equiv that
equivalent
this mapping
to the correspreturns
corresponding a vector containing x ones
onding infinite-dimensional dot pro follow
product.
duct. ed by infinitely many zeros.
We can write a kernel function k (x, x ) = min( x, x ) that is exactly equivalent
The most commonly used kernel is the Gaussian kernel
to the corresponding infinite-dimensional dot product.
The most commonly used
k(ukernel
, v) = is
N the v ; 0, σ2I )kernel
(u −Gaussian (5.84)

where N( x; µ, Σ) is the standard k(u, v) = normal v ; 0,yσ. IThis


(u densit
density ) kernel is also known (5.84)as
the radial basis function (RBF) kernel, N because
− its value decreases along lines in
where ( x; µ
v space radiating outw , Σ ) is the
outward standard normal
ard from u . The Gaussian densit y. This
kernel kernel is also
corresp ondsknown
corresponds to a dot as
the duct
pro
productradial
Ninbasis function (RBF) kernel,
an infinite-dimensional space, because
but the itsderiv
value
derivationdecreases
ation of thisalong
spacelines in
is less
v spacetforw
straigh
straightforw radiating
tforward ard than outw in ard
our fromexampleu . of ThetheGaussian
min kernel kernel
over corresp ondss. to a dot
the integer
integers.
product in an infinite-dimensional space, but the derivation of this space is less
We can think of the Gaussian kernel as performing a kind of template matching.
straightforward than in our example of the min kernel over the integers.
A training example x asso associated
ciated with training lab labelel y becomes a template for class
W e can think
y . When a test poin of the 0Gaussian kernel as
ointt x is near x according to Euclidean p erforming a kind of template
distance, matching.
the Gaussian
A training
kernel has example
a large resp x asso
response,
onse, ciated with training
indicating that xlab0 isel y becomes
very similar atotemplate for class
the x template.
y . When
The mo model a test p oin t x
del then puts a large weigh is near x according
eightt on the asso to Euclidean
associated distance,
ciated training lab labelthe Gaussian
el y . Ov
Overall,
erall,
k ernel has a large
the prediction will combin resp onse, indicating that
combinee many such training labels weigh x is very similar
eighted to the x template.
ted by the similarit
similarity y
The
of the mocorresp
del then
corresponding putstraining
onding a large w eight on the associated training label y . Overall,
examples.
the prediction will combine many such training labels weighted by the similarity
Supp
Support ort vector mac machineshines are not the only algorithm that can be enhanced
of the corresponding training examples.
using the kernel trick. Man Many y other linear mo models
dels can be enhanced in this wa way y. The
Supp ort v ector mac
category of algorithms that emplo hines are
employ not the only
y the kernel tric algorithm
trickk is kno
known that can b e enhanced
wn as kernel machines
using
or the metho
kernel kernelds
methods trick. Many other
(Williams linear models
and Rasmussen , 1996can behölk
; Sc enhanced
Schölk
hölkopf in, this
opf et al.
al., 1999wa). y. The
category of algorithms that employ the kernel trick is known as kernel machines
A ma
or kernelmajor jor drawbac
drawback
metho k to kernel
ds (Williams andmachines
Rasmussen is that
, 1996the; Sc
cost
hölkofopf
ev
evaluating
aluating the ).
et al., 1999 decision
function is linear in the number of training examples, because the i-th example
A ma jor adrawbac
term α k to kernel (i) machines is that the cost Support of evaluating themachines
decision
con
contributes
tributes i k (x, x ) to the decision function. Supp ort vector
function
are able is to linear
mitigatein the thisnumber of training
by learning an α examples,
vector that because i-th example
the mostly
contains zeros.
con tributes a term α k (x
Classifying a new example then requires ev, x ) to the decision function.
evaluating Supp ort vector
aluating the kernel function only for machines
are able to mitigate
the training examples that hav this by learning an α
havee non-zero α i. These vector training
that contains
examples mostly zeros.
are known
Classifying
as supp
support
ort ve actors
new. example then requires evaluating the kernel function only for
vectors
ctors.
the training examples that have non-zero α . These training examples are known
Kernel mac machines
hines also suffer from a high computational cost of training when
as support vectors.
the dataset is large. We will revisit this idea in Sec. 5.9. Kernel mac machines
hines with
generic kernels struggle to generalize well. We will explain why in Sec. 5.11.when
Kernel mac hines also suffer from a high computational cost of training The
the
mo
modern dataset is large. W e will revisit this idea in Sec.
dern incarnation of deep learning was designed to overcome these limitations of 5.9 . Kernel mac hines with
generic
kernel mac kernels
hines.struggle
machines. The current to generalize well. Wrenaissance
deep learning e will explain beganwhywhen
in Sec. 5.11. etThe
Hinton al.
mo dern incarnation of deep
(2006) demonstrated that a neural netw learning was
network designed to o vercome these
ork could outperform the RBF kernel SVM limitations of
k ernel mac hines.
on the MNIST benchmark. The current deep learning renaissance b egan when Hinton et al.
(2006) demonstrated that a neural network could outperform the RBF kernel SVM
on the MNIST benchmark. 142
CHAPTER 5. MACHINE LEARNING BASICS

5.7.3 Other Simple Supervised Learning Algorithms


W5.7.3
e hahavve Other Simple
already briefly Supervised
encountered anotherLearning Algorithms
non-probabilistic sup
supervised
ervised learning
algorithm, nearest neigh neighb bor regression. More generally generally,, k-nearest neighbors is
W e ha v e
a family of tecalready hniques that can be used for classification orsup
briefly
techniques encountered another non-probabilistic ervised learning
regression. As a
algorithm, nearest neigh b or regression.
k More generally
non-parametric learning algorithm, -nearest neighbors is not restricted to a fixed , k -nearest neighbors is
naumfamily
umb ber ofofparameters.
techniques that can be think
We usually used for of theclassification
k -nearest neigh or regression.
neighborsbors algorithm As a
non-parametric
as not hahaving
ving an anylearning algorithm,
y parameters, but krather
-nearest neighbors
implemen
implementing ting is anot restricted
simple functionto aoffixed
the
n um b er of parameters. W e usually think of the
training data. In fact, there is not even really a training stage or learning pro k -nearest neigh bors algorithm
process.
cess.
as not ha ving an y parameters,
Instead, at test time, when we wan but rather
antt to pro duce an output y for a new test inputthe
produceimplemen ting a simple function of x,
wtraining
e find the data. In fact,neigh
k-nearest therebors
neighb is notto x even really
in the a training
training datastageX. W ore learning
then return process.
the
aInstead,
verage of atthe
testcorresponding
time, when we yw ant toinpro
values theduce an output
training set. Thisy forwaorks newfor test input x,
essentially
w
an eyfind
any kindthe k
of sup-nearest
ervisedneigh
supervised bors to
learning x
whereinwe thecan training
define data X
an average . Weothen ver y return
values.theIn
av erage of the corresponding
the case of classification, we can av y values
averagein the
erage ov training
over set. This w orks
er one-hot code vectors c with cy = 1 for essentially
any ckind
and = offor
0 supall
ervised
other learning
v alues ofwhere
i. W e we
can canthendefinein an average
interpret
terpret the avoeragever y ovvalues.
er theseIn
i
the caseco
one-hot ofdes
codes classification, we can average
as giving a probability over one-hot
distribution ov
over code vectors
er classes. c with c = 1
As a non-parametric
and c =algorithm,
learning 0 for all other values
k-nearest of i. Wecan
neighbor canac then
hievee in
achiev
hiev terpret
very highthe average
capacit
capacity y. For ovexample,
er these
one-hot
supp
suppose co
ose we ha des
hav as giving a probability distribution ov er classes.
ve a multiclass classification task and measure performance with 0-1 As a non-parametric
learning
loss. In this setting,k-nearest
algorithm, 1-nearestneighbor
neighborcan con
convac hieve to
verges very high capacit
double the Ba Bay yy.esFerror
or example,
as the
supp
numumb ose we ha v e a multiclass
ber of training examples approac classification
approaches task
hes infinit
infinity and measure p erformance
y. The error in excess of the Ba with
Bay y0-1
es
loss. In this setting,
error results from cho 1 -nearest
hoosing neighbor con v erges
osing a single neighbor by breaking ties betwto double the Ba y es
etweenerror as
een equallythe
n um
distan b er of training
distantt neighbors randomly randomly.. When there is infinite training data, all testthe
examples approac hes infinit y. The error in excess of Bayx
points
ointsx es
errorha
will havresults
ve infinitelyfrom cman hoosing
many a single
y training set neighbor
neigh borsby
neighbors at breaking
distance zero. ties bIf etw weeen equally
allow the
distan t neighbors randomly
algorithm to use all of these neigh . When
neighb there is infinite training data,
bors to vote, rather than randomly choosing one all test p oints x
will have the
of them, infinitely
pro
procedure manyconv
cedure training
erges set
converges to theneighBaybors
Bayes es at distance
error rate. The zero.high If wecapacity
allow the of
algorithm
k-nearest neigh to use
neighbors all of these neigh b ors to vote,
bors allows it to obtain high accuracy giv rather than
givenrandomly choosing
en a large training set. one
of
Ho
How them,
wev er, the
ever, it dopro
doeses cedure
so at high convcomputational
erges to the Bay es error
cost, and itrate. ma
may y The high capacity
generalize very badly of
k-nearest
giv
givenen a small, neighfinite
bors allows
trainingit set.to obtain
One whigh eakness accuracy given aneigh
of k -nearest large
neighb borstraining
is thatset.
it
Ho w ev er, it do es so at high computational
cannot learn that one feature is more discriminativ cost, and it ma y generalize
discriminativee than another. For example, very badly
giv en a
imagine we havsmall,havee a regression task with xw∈
finite training set. One eakness
R100 dra k -nearest
of wn
drawn from anneigh bors is
isotropic that it
Gaussian
cannot learn that
distribution, but onlyone feature
a singleis vmoreariable discriminativ
x1Ris relev
relevan e than
an
ant t to another.
the output. For Suppose
example,
imaginethat
further we hav thise afeature
regression
simply taskencowith
encodesdesx the output drawn from, an
directly
directly, i.e.isotropic
that y = Gaussian
x1 in all
distribution,
cases. Nearest neigh but only
neighbor a single variable ∈x is relev an t to
bor regression will not be able to detect this simple pattern. the output. Suppose
further that this
The nearest neighbor of mostfeature simply enco
poin
ointsts des theboutput
x will e determineddirectlyby , i.e.
thethatlarge y =numx bin
numb er all
of
cases. Nearest
features x 2 throughneighx bor ,regression
not by the will notfeature
lone be ablex to . detect
Th this
the simple
output onpattern.
small
100 1 Thus us
The
trainingnearestsetsneighbor of mostbepoin
will essentially ts x will be determined by the large numb er of
random.
features x through x , not by the lone feature x . Thus the output on small
training sets will essentially be random.

143
CHAPTER 5. MACHINE LEARNING BASICS

R
R

144
CHAPTER 5. MACHINE LEARNING BASICS

Another type of learning algorithm that also breaks the input space into regions
and has separate parameters for each region is the de decision
cision tr treee (Breiman et al.,
1984 Another type
1984)) and its many varianof learning
ariants. algorithm that also breaks
ts. As shown in Fig. 5.7, eac eachthe
h no input
nodede of space into regions
the decision tree
and
is assohas separate
associated parameters for each region is
ciated with a region in the input space, and internal no the decision tr
nodes ee ( Breiman
des break that region et al.,
1984
in
into ) and its many v arian ts. As
to one sub-region for each child of the no shown in Fig.
node 5.7 , eac h no de of the decision
de (typically using an axis-aligned tree
is assoSpace
cut). ciated is with
th
thus
usa region in the into
sub-divided inputnon-ov
space,erlapping
and internal
non-overlapping nodeswith
regions, breaka that region
one-to-one
in to one
corresp sub-region
correspondence
ondence b et
etw for leaf
ween eachno cdes
hildand
nodes of the
inputnoregions.
de (typically
Eac
Each using
h leaf nodean usually
axis-aligned
maps
cut).
ev
every Space is th us sub-divided into
ery point in its input region to the same out non-ov erlapping
output. regions, with a
put. Decision trees are usuallyone-to-one
corresp ondence
trained with sp b et w
specializedeen leaf no des and input regions.
ecialized algorithms that are beyond the Eachscope
leaf node
of thisusually
book.mapsThe
every point
learning in its can
algorithm inputberegion to the
considered same output.if it
non-parametric Decision
is allow
allowed trees
ed are usually
to learn a tree
trained
of withsize,
arbitrary specialized algorithms
though decision that
trees areare beyond
usually the scope
regularized withof this book. The
size constraints
learning
that turnalgorithm
them into canparametric
be considered mo non-parametric
models
dels in practice.if Decision
it is allowtrees
ed to aslearn
theya tree
are
of arbitrary size, though decision trees are usually regularized
typically used, with axis-aligned splits and constant outputs within eac with size constraints
eachh no
node,
de,
that turntothem
struggle solveintosome parametric
problems mo thatdels
areineasy
practice.
even for Decision
logistictrees as they F
regression. are
or
texample,
ypically used,
if we havwith axis-aligned splits and constant outputs
havee a two-class problem and the positive class occurs wherev within eac h no
whereverde,
er
struggle to solve some problems that are easy even for
x2 > x1 , the decision boundary is not axis-aligned. The decision tree will th logistic regression. For
thus
us
example, if we hav e a tw o-class problem
need to approximate the decision boundary with man and themanyp ositive
y no
nodes, class o ccurs wherev
des, implementing a step er
x > x , that
function the decision
constantly boundary
walks backis notandaxis-aligned.
forth acrossThe the decision
true decisiontree will thus
function
need axis-aligned
with to approximate the decision boundary with many nodes, implementing a step
steps.
function that constantly walks back and forth across the true decision function
withAs we hahav
axis-alignedve seen,
steps.nearest neighbor predictors and decision trees hav havee man
many y
limitations. Nonetheless, they are useful learning algorithms when computational
As we are
resources haveconstrained.
seen, nearestWneighbor
e can also predictors
build in and decision
intuition
tuition for more trees have many
sophisticated
limitations.
learning Nonetheless,
algorithms they areab
by thinking useful
about learning
out the algorithms
similarities and when computational
differences b et
etw
ween
resources are constrained. W e can also build
sophisticated algorithms and k-NN or decision tree baselines. intuition for more sophisticated
learning algorithms by thinking about the similarities and differences b etween
See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine
sophisticated algorithms and k-NN or decision tree baselines.
learning textb
textboooks for more material on traditional sup supervised
ervised learning algorithms.
See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine
learning textbooks for more material on traditional supervised learning algorithms.
5.8 Unsup
Unsupervised
ervised Learning Algorithms

5.8 from
Recall Unsup ervised
Sec. 5.1.3 Learning
that unsup
unsupervised Algorithms
ervised algorithms are those that exp
experience
erience only
“features” but not a sup supervision
ervision signal. The distinction betw between
een sup
supervised
ervised and
Recall ervised
unsup from Sec.
unsupervised 5.1.3 thatisunsup
algorithms ervised algorithms
not formally and rigidlyaredefined
those that experience
because there isonly
no
“features”
ob
objective but not a sup ervision signal. The distinction betw een sup ervised
jective test for distinguishing whether a value is a feature or a target provided by and
unsup
a sup ervised Informally
supervisor.
ervisor. algorithms
Informally, is notervised
, unsup formally
unsupervised and rigidly
learning defined
refers to most battempts
ecause there is no
to extract
ob jective testfrom
information for distinguishing
a distributionwhether
that do a vnot
alue require
is a feature or a lab
human target
labor
or toprovided by
annotate
a supervisor.
examples. TheInformally
term is , usually
unsupervised
asso learning
associated
ciated withrefers to most
density attemptslearning
estimation, to extract
to
information
dra
draw from a distribution that do not require human lab or
w samples from a distribution, learning to denoise data from some distribution,to annotate
examples. The term
finding a manifold is the
that usually
dataasso
liesciated with
near, or density the
clustering estimation,
data intolearning
groups to
of
draw samples from a distribution, learning to denoise data from some distribution,
finding a manifold that the data lies near, 145 or clustering the data into groups of
CHAPTER 5. MACHINE LEARNING BASICS

related examples.
A classic unsup
unsupervised
ervised learning task is to find the “best” represen representation
tation of the
related examples.
data. By ‘b ‘best’
est’ we can mean differen
differentt things, but generally sp speaking
eaking we are lo looking
oking
A classic unsup ervised
for a representation that preserv learning
preserves task
es as mucmuchis to find the
h information ab “best”
about represen tation
out x as possible while of the
data.
ob
obeying
eyingBysome
‘best’penalty
we canor mean differenaimed
constraint t things, but generally
at keeping speaking we are
the representation looking
simpler or
for a representation
more accessible thanthat preserves as much information about x as possible while
x itself.
obeying some penalty or constraint aimed at keeping the representation simpler or
There are multiple wa ways
ys of defining a simpler represen representation.
tation. Three of the
more accessible than x itself.
most common include lo low
wer dimensional represen representations,
tations, sparse representations
There
and indep are
independen
enden m ultiple
endentt represen wa ys
tations. Low-dimensionalrepresen
representations.of defining a simpler tation. Three
representations attemptof the
to
most common
compress as minclude
uc
uch lower dimensional
h information ab out x as
about represen
possible tations, sparse representations
in a smaller represen
representation.
tation.
and indep
Sparse endentations
represen t represen
representations tations.
(Barlow , 1989Low-dimensional
; Olshausen and representations
Field, 1996; Hin attempt
Hinton
ton and to
compress as ,m1997
Ghahramani uch )information
embed the ab dataset x
out into as possible
a represenin atation
representation smaller represen
whose tation.
entries are
Sparse
mostly zerorepresen
zeroes tations (Barlow , 1989 ; Olshausen and Field
es for most inputs. The use of sparse representations typically requires, 1996 ; Hin ton and
Ghahramani
increasing , 1997
the ) embed theofdataset
dimensionality into a represen
the representation, so tation
that thewhose entries
represen
representation are
tation
mostly
b ecoming zero es for zero
mostly mostesinputs.
zeroes do
does
es notThe use ofto
discard sparse
too o muc
much representations
h information. This typically requires
results in an
increasing the dimensionality
overall structure of the represen of
representation the representation, so that the
tation that tends to distribute data along the axesrepresen tation
b ecoming mostly
of the represen
representation zero es do es
tation space. Indep not discard
endentttorepresen
Independen
enden o much tations
information.
representations attempt Thistoresults in an
disentangle
overall
the structure
sources of the represen
of variation underlying tationthethat
datatends to distribute
distribution suc
such data the
h that along the axes
dimensions
of the
of the representation
representation are space. Independen
statistically t representations attempt to disentangle
independent.
the sources of variation underlying the data distribution such that the dimensions
Of course these three criteria are certainly not mutually exclusive. Lo Low-
w-
of the representation are statistically independent.
dimensional representations often yield elements that hav havee fewer or weak eakerer de-
Of course
pendencies thanthese
the three
original criteria are certainly
high-dimensional not This
data. mutually exclusive.
is because one waLo y w-
to
dimensional representations often yield
reduce the size of a representation is to find and remov elements that hav e fewer or w
removee redundancies. Identifyingeak er de-
p endencies
and remo
removing than the original high-dimensional
ving more redundancy allows the dimensionality data. This reduction
is becausealgorithm
one way to to
reduce
ac
achiev
hiev the size of a representation is to find
hievee more compression while discarding less information. and remov e redundancies. Identifying
and removing more redundancy allows the dimensionality reduction algorithm to
The
achiev notioncompression
e more of representation is one of the
while discarding central
less themes of deep learning and
information.
therefore one of the central themes in this book. In this section, we dev develop
elop some
The notion of representation
simple examples of represenrepresentation is one of the central themes of
tation learning algorithms. Together, these example deep learning and
therefore one of the
algorithms show how to op central themes
operationalize in this b o ok. In this
erationalize all three of the criteria ab section, w
above
ov
ove.dev elop
e. Most of thesome
simple examples
remaining chapters of represen
in
intro
tro
troduce
ducetation learning
additional algorithms. T
representation ogether,algorithms
learning these example that
algorithms
dev
develop show how to op erationalize
elop these criteria in different ways or in all three
intro
tro of
troduce the criteria
duce other criteria. ab ove. Most of the
remaining chapters introduce additional representation learning algorithms that
develop these criteria in different ways or introduce other criteria.
5.8.1 Principal Comp
Componen
onen
onents
ts Analysis

5.8.1
In Principal
Sec. 2.12 , we sa w Comp
saw that theonen ts Analysis
principal comp
components
onents analysis algorithm pro
provides
vides a
means of compressing data. We can also view PCA as an unsup unsupervised
ervised learning
In Sec. 2.12
algorithm , welearns
that saw that the principal
a represen tation comp
representation onents
of data. analysis
This algorithm
represen tation ispro
representation videsona
based
tmeans
wo of of
thecompressing
criteria for data. We can
a simple also tation
represen view PCA
representation as ed
describ an ab
described unsup
abov
ove.ervised
ove. PCA learning
learns a
algorithm that learns a representation of data. This representation is based on
two of the criteria for a simple represen 146tation describ ed ab ove. PCA learns a
CHAPTER 5. MACHINE LEARNING BASICS

z= x W z x
z
z= x W z
z
represen
representation
tation that has lowlower
er dimensionalit
dimensionality y than the original input. It also learns
a represen
representation
tation whose elemen
elements ts hav
havee no linear correlation with each other. This
represen
is a firsttation that
step tow ardhas
toward thelow er dimensionalit
criterion of learning y than the original
represen tations input.
representations whose It also learns
elemen
elements ts are
a represen tation
statistically indep whose
independent. elemen ts
endent. To achiev hav e no linear correlation
achievee full independence, a represen with each
representation other. This
tation learning
is a first step tow ard
algorithm must also remov the criterion of learning represen
removee the nonlinear relationships bet tations
etw whose elemen
ween variables. ts are
statistically independent. To achieve full independence, a representation learning
PCA learns an orthogonal, linear transformation of the data that pro projects
jects an
algorithm must also remove the nonlinear relationships between variables.
input x to a represen tation z as sho
representation shown
wn in Fig. 5.8. In Sec. 2.12, we saw that we
PCA learns an orthogonal,
could learn a one-dimensional represen linear transformation
representation
tation that best of the data that the
reconstructs pro jects an
original
input x to a represen tation z as sho wn in Fig. 5.8
data (in the sense of mean squared error) and that this represen . In Sec. 2.12 , w e
representationsaw that
tation actually w e
could
corresplearn
onds atoone-dimensional
corresponds the first principal represen
comp
componentation
onen
onent that
t of thebdata.
est reconstructs
Th
Thusus we can theuseoriginal
PCA
data (in the sense
as a simple and effectivof mean squared
effectivee dimensionalit
dimensionality error) and that
y reduction metho this
method represen tation
d that preserv
preserves actually
es as muc
much h
corresp onds to the first principal comp onen t of the data. Th
of the information in the data as possible (again, as measured by least-squares us w e can use PCA
as a simple anderror).
reconstruction effectivIne the
dimensionalit
following,ywreduction
following, e will study metho
ho
howwdthethatPCA
preserv es as much
representation
of the information
decorrelates in thedata
the original datarepresentation
as possible (again,X. as measured by least-squares
reconstruction error). In the following, we will study how the PCA representation
Let us consider the m × n -dimensional design matrix X . We will assume that
decorrelates the original data representation X .
the data has a mean of zero, E[ x] = 0. If this is not the case, the data can easily
Let us consider
be centered the m the
by subtracting n -dimensional design matrix X . preprocessing
We willcessing
assume that
E mean from all examples in a prepro step.
the data has a mean of zero, × [ x] = 0. If this is not the case, the data can easily
The un
unbiased
biased sample
be centered by subtracting the cov
covariance
ariance
meanmatrix
from all asso
associated
ciated with
examples in a X is giv
preprogiven
en by: step.
cessing
The unbiased sample covariance matrix 1 asso>ciated with X is given by:
Var[x] = X X. (5.85)
m−1
1
Var[x] = 147 X X . (5.85)
m 1

CHAPTER 5. MACHINE LEARNING BASICS

PCA finds a represen tation (through linear transformation) z = x>W where


representation
Var[z ] is diagonal.
PCA finds a representation (through linear transformation) z = x W where
In Sec. 2.12, we sa
saw
w that the principal comp
components
onents of a design matrix X are
Var[z ] is diagonal. >
giv
given
en by the eigenv
eigenvectors
ectors of X X . From this view,
In Sec. 2.12, we saw that the principal components of a design matrix X are
given by the eigenvectors of XXX>.XFrom= Wthis >
ΛWview,
. (5.86)

X X e=deriv
In this section, we exploit an alternativ
alternative WΛ W of
derivation
ation . the principal components. (5.86)
The
principal comp
componen
onen
onents
ts may also be obtained via the singular value decompdecomposition.
osition.
In
Sp this section,
Specifically
ecifically we exploit an
rightt singular vectors of X . To see this, let W be The
ecifically,, they are the righalternativ e derivation of the principal components. the
principal
righ comp onen ts may also
rightt singular vectors in the decompb e obtained via the singular
osition X = U ΣW . W
decomposition > value decomp
Wee then recov osition.
recover
er the
Specifically
original , they
eigen
eigenv areequation
vector the righwith
t singular
W asvectors of X
the eigenv see this, let W be the
. Tobasis:
eigenvector
ector
right singular vectors in the decomposition X = U ΣW . We then recover the
 >
original eigenvector Xequation
>
X = Uwith ΣWW > as the eigenv
U ΣW > = ector
W Σ2basis:
W >. (5.87)

X X = U ΣW U ΣW = W Σ W . (5.87)
The SVD is helpful to show that PCA results in a diagonal Var ar[[ z]. Using the
SVD of X , we can express the variance of X as:
The SVD is helpful to show that PCA results in a diagonal Var[ z]. Using the
1
SVD of X , we can express
Var[xthe
] =variance Xof> X as:
X (5.88)
m−1
1
Var[x] == 1 (X UΣXW > )>U ΣW > (5.88)
(5.89)
m
m−1 1
1
= 1− W
= (U Σ>W >) U ΣW > (5.89)
m − 11 Σ U U ΣW
m
(5.90)
1
= 1− W W Σ2 U >U ΣW (5.90)
(5.91)
= m − 11 Σ W ,
m
1−
=
> WΣ W , (5.91)
where we use the fact that U Um= I 1because the U matrix of the singular value
definition is defined to be orthonormal. takee z = x> W , we
This shows that if we tak
−I because
where w e use the fact
can ensure that the covthat U
covariance U = the U matrix
ariance of z is diagonal as required: of the singular value
definition is defined to be orthonormal. This shows that if we take z = x W , we
can ensure that the covariance 1
Var[z ] =of z is diagonal
Z >Z as required: (5.92)
m−1
1
= 1 W
Var[z ] = Z >ZX >X W (5.92)
(5.93)
m − 11
m
= 11− W X> X2 W > (5.93)
= m 1WW Σ WW (5.94)
m−1
1
= 1− Σ
= W2W Σ W W (5.94)
(5.95)
m − 11 ,
m
1
= − Σ> , (5.95)
where this time we use the fact thatm W 1 W = I , again from the definition of the
SVD.
where this time we use the fact that−W W = I , again from the definition of the
SVD. 148
CHAPTER 5. MACHINE LEARNING BASICS

The ab
abov
ov
ovee analysis shows that when we pro ject the data x to z, via the linear
project
transformation W , the resulting representation has a diagonal co covvariance matrix
The
(as giv
givenab ov e 2analysis
en by Σ ) whicwhich shows that when we pro ject the data
h immediately implies that the individual elemenx to z, via ts
elements theoflinear
z are
transformation W
mutually uncorrelated., the resulting representation has a diagonal co variance matrix
(as given by Σ ) which immediately implies that the individual elements of z are
This ability of PCA to transform data into a representation where the elemen elementsts
mutually uncorrelated.
are mutually uncorrelated is a very imp important
ortant prop
property
erty of PCA. It is a simple
This ability
example of PCA
of a represen to transform
representation
tation that attemptdata to
into a representation where the elements
are mutually underlying
uncorrelated theis data.
a veryInimp
theortant
case ofprop ertythis
PCA, of PCA.
disen It is a simple
disentangling
tangling takes
example of a represen tation that attempt to
the form of finding a rotation of the input space (describ (described
ed by W ) that aligns the
principal axes underlying the data.
of variance with In the
the basis casenew
of the of PCA, this disen
representation tangling
space asso takes
associated
ciated
the form
with z. of finding a rotation of the input space (described by W ) that aligns the
principal axes of variance with the basis of the new representation space associated
While correlation is an imp important
ortant category of dep dependency
endency b etetw
ween elements of
with z.
the data, we are also in interested
terested in learning represen
representations
tations that disentangle more
While correlation
complicated is an imp
forms of feature ortant
dep category
dependencies.
endencies. Forofthis,
depwe
endency b etwmore
will need een elements
than what of
the data,
can we with
be done are also interested
a simple linearintransformation.
learning representations that disentangle more
complicated forms of feature dependencies. For this, we will need more than what
can be done with a simple linear transformation.
5.8.2 k-means Clustering
k
5.8.2 example
Another -means
of aClustering
simple representation learning algorithm is k -means clustering.
The k -means clustering algorithm divides the training set in to k differen
into differentt clusters
Another example of a simple
of examples that are near eac eachh other. We can thus think isofk -means
representation learning algorithm clustering.
the algorithm as
The
pro k -means clustering algorithm
viding a k-dimensional one-hot co
providing divides the training
de vector h represen
code set in
representing to k differen t
ting an input x. If x clusters
of examples that are near eac h other.
belongs to cluster i , then h i = 1 and all other en W e can thus
entries think of the algorithm
tries of the represen
representationtation h are as
providing a k-dimensional one-hot code vector h representing an input x. If x
zero.
belongs to cluster i , then h = 1 and all other entries of the representation h are
zero.The one-hot co de provided by k-means clustering is an example of a sparse
code
represen
representation,
tation, because the ma majority
jority of its entries are zero for ev every
ery input. Later,
The
we will devone-hot
develop co de provided by k -means clustering is an
elop other algorithms that learn more flexible sparse representations, example of a sparse
represen
where tation,
more because
than one en the
tryma
entry jority
can of its entries
be non-zero for are
eac
eachhzero
input for xev. ery input. co
One-hot Later,
codes
des
w e will
are develop example
an extreme other algorithms
of sparsethat learn more flexible
representations that lose sparse
manyrepresentations,
of the b enefits
where more thanrepresentation.
of a distributed one entry can The be non-zero
one-hot for co
codeeacstill
de h input
confersx . someOne-hot codes
statistical
are
adv an extreme
advantages
antages examplecon
(it naturally of vsparse
conv eys therepresentations that loseinmany
idea that all examples the same of the b enefits
cluster are
of a distributed representation. The one-hot
similar to each other) and it confers the computational adv co de still confers
advantage some statistical
antage that the en entire
tire
advantages
represen
representation (it naturally
tation ma
may conveys the
y be captured by aidea that
single in all examples in the same cluster are
integer.
teger.
similar to each other) and it confers the computational advantage that the entire
The ktation
represen -meansma algorithm works by
y be captured by initializing k differen
a single integer.differentt cen troids {µ(1), . . . , µ(k) }
centroids
to different values, then alternating betw etween
een two different steps un untiltil con
conv vergence.
The k -means algorithm works by initializing k differen t
In one step, each training example is assigned to cluster , where is the indexicen troids i µ , . . . , µ of
to different
the nearest vcen
alues,
troidthen
centroid µ (i)alternating
. In the other betw een eac
step, twohdifferent
each cen troidsteps
centroid µ(i) isunup{tildated
convergence.
updated to the}
In one step, each training example ( j ) is assigned
mean of all training examples x assigned to cluster i. to cluster i , where i is the index of
the nearest centroid µ . In the other step, each centroid µ is updated to the
mean of all training examples x assigned 149 to cluster i.
CHAPTER 5. MACHINE LEARNING BASICS

One difficulty pertaining to clustering is that the clustering problem is inherently


ill-p
ill-posed,
osed, in the sense that there is no single criterion that measures ho how w well a
One difficulty pertaining
clustering of the data corresp to
corresponds clustering is that the clustering problem
onds to the real world. We can measure properties of is inherently
ill-p osed, in
the clustering suc the
suchsense that there
h as the average Euclidean is no single criterion
distance from that measures
a cluster cen how well
centroid
troid to the a
clustering
mem
memb of the data corresp onds to the real w orld. W e
bers of the cluster. This allows us to tell how well we are able to reconstruct can measure properties of
the clustering
the training data such from
as thethe average
clusterEuclidean
assignmen
assignments. distance
ts. Wefrom a cluster
do not kno
know wcenhowtroidwellto the
the
members
cluster of the cluster.
assignments correspThisond
correspond allows us to tell how
to properties of the wellrealweware able
orld. to reconstruct
Moreo
Moreov ver, there
the
ma
may training
y be man
many data from the cluster assignmen
y different clusterings that all corresp ts. W
correspond e do not kno
ond well to some propw how well
propert ert
erty ythe
of
cluster assignments
the real world. We may hop corresp ond to properties of the real w orld.
hopee to find a clustering that relates to one feature but Moreo v er, there
may beaman
obtain y different
differen
different, t, equally clusterings that all that
valid clustering corresp ond relev
is not well ant
to some
relevant to ourprop ertyFor
task. of
the real wsupp
example, orld.ose
suppose Wthat
e may wehop rune tw
to ofind
two a clustering
clustering that relates
algorithms to oneconsisting
on a dataset feature but of
obtain a differen
images of red truc t,
trucks, equally valid clustering
ks, images of red cars, images of gra that is not
gray relev ant to our
y trucks, and images of gra task. For
grayy
example, supp ose that w e run tw o clustering algorithms
cars. If we ask each clustering algorithm to find two clusters, one algorithm ma on a dataset consisting mayof
y
images
find of red truc
a cluster ks, images
of cars of red cars,
and a cluster images
of trucks, of gra
while y trucks,
another ma
mayand
y findimages of graofy
a cluster
cars.vehicles
red If we ask andeach clustering
a cluster of gray algorithm
vehicles.toSupposefind twowclusters,
e also runone algorithm
a third may
clustering
find a cluster
algorithm, whichof cars
is alloand
allow weda tocluster of trucks,
determine while
the num
numberberanother may This
of clusters. find amay cluster
assignof
red vehicles and a cluster of gray
the examples to four clusters, red cars, red truc v ehicles. Suppose
trucks,
ks, gra w
gray e also run
y cars, and gra a third
gray clustering
y trucks. This
algorithm,
new clustering which now is at
alloleast
wed captures
to determine the numab
information ber
about
out of bclusters.
oth This
attributes, may
but assign
it has
the examples to four
lost information about similarit clusters, red
similarity y. Red cars are in a different cluster from This
cars, red truc ks, gra y cars, and gra y trucks. gra
gray y
new clustering now at
cars, just as they are in a differenleast captures information ab out b oth attributes,
differentt cluster from gray trucks. The output of the but it has
lost information
clustering algorithm do about doessimilarit
es not tell y. usRedthat carsredarecarsin are
a different cluster
more similar to from
gra
grayy gra
carsy
cars, they
than just asarethey
to grayare truc
in aks.
trucks. differen
Theyt arecluster
differenfrom
different gray btrucks.
t from The and
oth things, output thatofistheall
wclustering
e kno
know.w. algorithm does not tell us that red cars are more similar to gray cars
than they are to gray trucks. They are different from both things, and that is all
we knoThese
w. issues illustrate some of the reasons that we ma may y prefer a distributed
represen
representation
tation to a one-hot represen representation.
tation. A distributed represen representation
tation could ha havve
twoThese issuesforillustrate
attributes some of therepresenting
each vehicle—one reasons thatitswe mayand
color prefer
one arepresenting
distributed
representation
whether it is atocar a one-hot
or a truck.represen Ittation.
is stillAnot distributed
en tirely represen
entirely clear whattation the could have
optimal
two attributes
distributed for each
represen
representation
tationvehicle—one
is (ho
(howw canrepresenting
the learningitsalgorithmcolor andknow one representing
whether the
whether it is a
two attributes we are in car or a truck.
interested It is still not
terested in are color and car-v en tirely clear
car-versus-truckwhat the
ersus-truck rather optimal
than
distributed
man
manufacturer
ufacturer represen
and tation
age?) is
but (ho w
having can man
manythey learning
attributes algorithm
reduces know
the whether
burden on the
the
talgorithm
wo attributes we
to guess whic are
which in terested in are
h single attribute we care abcolor and car-v
about, ersus-truck
out, and allo allows rather
ws us to measure than
man ufacturer
similarit
similarity y betw and
between een obage?)
objects but having man y attributes
jects in a fine-grained way by comparing many reduces the burden on the
attributes
algorithm to guess whic h single
instead of just testing whether one attribute matcattribute we care ab
matches. out,
hes. and allo ws us to measure
similarity between objects in a fine-grained way by comparing many attributes
instead of just testing whether one attribute matches.
5.9 Sto
Stocchastic Gradient Descent

5.9 all
Nearly Stoof cdeep
hastic Gradient
learning is pow ered Descent
owered by one very imp
importan
ortan
ortantt algorithm: sto
stochastic
chastic
gr
gradient
adient desc
descent
ent or SGD. Sto
Stocchastic gradient descent is an extension of the gradient
Nearly all of deep learning is powered by one very important algorithm: stochastic
gradient descent or SGD. Stochastic gradient
150 descent is an extension of the gradient
CHAPTER 5. MACHINE LEARNING BASICS

descen
descentt algorithm introduced in Sec. 4.3.
A recurring problem in macmachine
hine learning is that large training sets are necessary
descent algorithm introduced in Sec. 4.3.
for go
gooo d generalization, but large training sets are also more computationally
exp A recurring
expensiv
ensiv
ensive.
e. problem in machine learning is that large training sets are necessary
for goo d generalization, but large training sets are also more computationally
The cost function used by a machine learning algorithm often decomp decomposes
oses as a
expensive.
sum over training examples of some per-example loss function. For example, the
Theecost
negativ
negative functionlog-likelihoo
conditional used by a machine
log-likelihoodd of thelearning
trainingalgorithm
data can often decomp
be written asoses as a
sum over training examples of some per-example loss function. For example, the
m
negative conditional log-likelihood of the training1 Xdata can be written as
J (θ) = E ,y∼p̂ L(x, y, θ) = L(x(i) , y (i), θ) (5.96)
m
E 1 i=1
J (θ) = L(x, y, θ) = L(x , y , θ) (5.96)
where L is the per-example loss L(x, y, θ) = − mlog p(y | x; θ ).

where ForLthese
is theadditiv
additive e cost functions,
per-example loss L(x, ygradien
gradient
, θ) = t descent
log p(y requires
x; θ ). computing
For these additive cost functions, 1 X
m
gradien− X |requires computing
t descent
∇θ J (θ) = ∇ θ L(x(i) , y (i), θ). (5.97)
m
1 i=1
J (θ ) = L(x , y , θ). (5.97)
The computational cost ∇ of this op m
operation
eration is O( m ) . As the training set size grows to

billions of examples, the time to take a single gradien gradientt step becomes prohibitivprohibitively
ely
The
long. computational cost of this op eration is O( m ) . As the training set size grows to
billions of examples, the time to takeX a single gradient step becomes prohibitively
The insight of sto stocchastic gradient descen
descentt is that the gradien
gradientt is an expexpectation.
ectation.
long.
The exp expectation
ectation may be appro approximately
ximately estimated using a small set of samples.
Sp The insight
Specifically
ecifically
ecifically,, on eachof stostep
chastic gradient
of the descenwteiscan
algorithm, that the gradien
sample a minibt isatch
minibatch an exp ectation.
of examples
The
B = {exp ectation
x(1) may
, . . . , x (m
0) } be approximately estimated using a small set of samples.
dra
drawn
wn uniformly from the training set. The minibatc minibatch h size
Sp
m 0 ecifically , on each step of the algorithm, we can sample a minib atch of examples
B is typically chosen to be a relatively small num numberber of examples, ranging from
1 to= a xfew, .h.undred.
.,x drawn uniformly
Crucially
Crucially, from the
, m0 is usually heldtraining
fixed asset. the The minibatc
training h size
set size m
m
gro is
grows. {typically c hosen } to b e a relatively small num ber of examples,
ws. We may fit a training set with billions of examples using updates computed ranging from
1ontoonly
a few hundred.examples.
a hundred Crucially, m is usually held fixed as the training set size m
grows. We may fit a training set with billions of examples using updates computed
The estimate of the gradient is formed as
on only a hundred examples.
m0
The estimate of the gradient1 is formed
X as
g = 0 ∇θ L(x(i) , y(i) , θ). (5.98)
m
1 i=1
g= L(x , y , θ). (5.98)
m B. The stochastic gradien
using examples from the minibatch descen
gradientt descentt algorithm

then follo
follows
ws the estimated gradient B do
downhill:
wnhill:
using examples from the minibatch . The stochastic gradient descent algorithm
then follows the estimated gradientθ do
←X wnhill:
θ − g , (5.99)

where  is the learning rate. θ θ g , (5.99)


← 151−
where  is the learning rate.
CHAPTER 5. MACHINE LEARNING BASICS

Gradien
Gradientt descen
descentt in general has often been regarded as slow or unreliable. In
the past, the application of gradien gradientt descen
descentt to non-conv
non-convex ex optimization problems
Gradien t descen t in general
was regarded as foolhardy or unprincipled. Todahas often b een regarded
day y, we as kno
know slow
w that or unreliable.
the mac
machine In
hine
the past, mo
learning thedels
application
models describ
described edof in
gradien
Part tIIdescen
work tvto erynon-conv
well when ex optimization
trained with problems gradient
w as
descen regarded
descent. as foolhardy or unprincipled. T o da y
t. The optimization algorithm may not be guaranteed to arrive at even, w e kno w that the mac hinea
learning
lo
local
cal minimummodelsindescrib ed in Pamount
a reasonable art II work verybut
of time, wellit when
often trained
finds a vwith ery lo gradient
low
w value
descen t. The optimization algorithm
of the cost function quickly enough to be useful. may not b e guaranteed to arrive at even a
local minimum in a reasonable amount of time, but it often finds a very low value
Sto
Stocchastic gradient descen descentt has man many y imp
important
ortant uses outside the con context
text of
of the cost function quickly enough to be useful.
deep learning. It is the main way to train large linear mo models dels on very large
Stochastic
datasets. For agradient
fixed mo modeldescen
del size,t has manyper
the cost impSGDortant up usesdo
update
date outside
does es not the dep context
depend
end on theof
deep learning.
training set size mIt. is
In the main we
practice, waoften
y to train large linear
use a larger mo
modeldel asmothe delstraining
on very setlarge
size
datasets. F or a fixed mo del size, the
increases, but we are not forced to do so. The num cost p er SGD
number up date
ber of up do
updates es not dep end
dates required to reach on the
training
con
conv set size m . In practice, we often
vergence usually increases with training set size. Ho use a larger mo del
How wevas
ever, the training
er, as m approac set
approaches size
hes
increases,
infinit
infinity y, thebutmowedel
modelarewill
notev forced
even
en
entuallyto do
tually con so.
converge Theto
verge num itsber
best of pup dates required
ossible test errortobreach
efore
convergence
SGD has sampledusuallyev increases
every
ery example with training
in the training set set.
size.Increasing
However,masfurther m approac hes
will not
infinity,the
extend theamount
model ofwilltraining
eventuallytime con vergetotoreac
needed itshbthe
reach est p moossible
del’s btest
model’s error before
est possible test
SGD has sampled ev ery example in the training set. Increasing
error. From this point of view, one can argue that the asymptotic cost of training m further will not
aextend
mo delthe
model amount
with SGD is of O training
(1) as atime needed
function of to
m.reach the model’s best possible test
error. From this point of view, one can argue that the asymptotic cost of training
a mo Prior to the
del with SGDadv
advenen
ent
is t (1)
O of deep
as a learning,
function of themmain
. wa
way y to learn nonlinear models
was to use the kernel trick in com combination
bination with a linear mo model.
del. Man Many y kernel learning
Prior to the adv en t of deep learning, the main wa
algorithms require constructing an m × m matrix Gi,j = k (x , x ). Constructing y to learn
( i ) ( nonlinear
j ) models
w as to
this use the
matrix haskernel trick in com
computational bination
cost O (m 2)with a linear
, which model.
is clearly Many kernel
undesirable learning
for datasets
algorithms
with billionsrequire constructing
of examples. In an m m matrix
academia, startingG in = k2006,
(x , xdeep ). learning
Constructing was
this matrix has computational cost O ×(m ), which is clearly
initially interesting because it was able to generalize to new examples better undesirable for datasets
with billions
than comp
competing of algorithms
eting examples. when In academia,
trained on starting
medium-sizedin 2006,datasetsdeep learningwith tens wasof
initially interesting
thousands of examples. because
Soon itafter,
was deepable learning
to generalize garnered to new examples
additional better
interest in
than
industrycomp
industry, eting algorithms
, because it pro videdwhen
provided trained
a scalable waony ofmedium-sized
training nonlinear datasets withon
models tens of
large
thousands of examples. Soon after, deep learning garnered additional interest in
datasets.
industry, because it provided a scalable way of training nonlinear models on large
Sto
Stocchastic gradien
gradientt descen
descentt and man many y enhancements to it are describ described ed further
datasets.
in Chapter 8.
Stochastic gradient descent and many enhancements to it are described further
in Chapter 8.
5.10 Building a Mac
Machine
hine Learning Algorithm

5.10 allBuilding
Nearly deep learninga Mac hinecan
algorithms Learning
be describ
describedAlgorithm
ed as particular instances of
a fairly simple recip
recipe:
e: combine a specification of a dataset, a cost function, an
Nearly all deep
optimization prolearning
procedure
cedure andalgorithms
a mo del. can be described as particular instances of
model.
a fairly simple recipe: combine a specification of a dataset, a cost function, an
For example,
optimization the linear
procedure andregression
a model. algorithm combines a dataset consisting of
For example, the linear regression algorithm combines a dataset consisting of
152
CHAPTER 5. MACHINE LEARNING BASICS

X and y , the cost function

X and y , the cost function J (w, b) = −E ,y∼p̂ log pmodel(y | x), (5.100)
E
the mo
model
del spspecification
ecification J (w , b) =(y | x ) = N (ylog
pmodel ; x>pw + b, (y1)
1),x ),
, and, in most cases,(5.100)
the
optimization algorithm defined b− y solving for where the |gradien gradientt of the cost is zero
the mo del sp ecification
using the normal equations. p (y x ) = (y ; x w + b, 1) , and, in most cases, the
optimization algorithm defined by| solving N for where the gradient of the cost is zero
By realizing that
using the normal equations. w e can replace an
any y of these comp
componen
onen
onents ts mostly independently
from the others, we can obtain a very wide variety of algorithms.
By realizing that we can replace any of these components mostly independently
The cost function typically includes at least one term that causes the learning
from the others, we can obtain a very wide variety of algorithms.
pro
process
cess to perform statistical estimation. The most common cost function is the
Theecost
negativ
negative function d,
log-likelihoo
log-likelihood, typically
so thatincludes
minimizing at leasttheonecostterm that causes
function causesthe learning
maximum
pro
lik cessoto
likeliho
eliho
elihoo perform statistical estimation. The most common cost function is the
d estimation.
negative log-likelihood, so that minimizing the cost function causes maximum
The cost function ma may y also include additional terms, suc suchh as regularization
likelihood estimation.
terms. For example, we can add weigh eightt deca
decay y to the linear regression cost function
The cost function may also include additional terms, such as regularization
to obtain
terms. For example, J (ww,eb)can
= λadd||ww ||22eigh
− Et deca y tolog
thep linear regression cost function
,y∼p model(y | x). (5.101)
to obtain
This still allows closed-form E
J (w, b) = λoptimization.
w log p (y x). (5.101)
ThisIfstill
we callows
hange closed-form
the mo
modeldel tooptimization.
||be ||nonlinear,
− then most cost functions| can no longer
be optimized in closed form. This requires us to choose an iterativ iterativee numerical
If w e c hange
optimization pro the
procedure,mo del to b e
cedure, such as gradien nonlinear,
gradientt descen then
descent. most
t. cost functions can no longer
be optimized in closed form. This requires us to choose an iterative numerical
The recip
recipeepro
optimization forcedure,
constructing
such as a learning
gradientalgorithm
descent. by com combining
bining momodels,
dels, costs, and
optimization algorithms supp supports
orts both sup supervised
ervised and unsupunsupervised
ervised learning. The
The
linear recipe forexample
regression constructing
sho
showswsa learning
ho
how w to supp algorithm
support ort sup bervised
y combining
supervised models,
learning. costs,
Unsup
Unsupervisedand
ervised
optimization
learning can balgorithms
e supp ortedsupp
supported orts botha dataset
by defining supervised thatandcon unsup
contains
tains onlyervised
onlyX learning.
X and The
providing
linear regression
an appropriate unsup example
unsupervised sho ws ho
ervised cost and mo w to supp
model. ort sup ervised learning. Unsup
del. For example, we can obtain the first ervised
learning
PCA can bby
vector e supp
sp orted by
specifying
ecifying defining
that our loss a dataset
function is contains only X and providing
that
an appropriate unsupervised cost and model. For example, we can obtain the first
PCA vector by sp ecifyingJ (that w) =our E ∼p loss function
||x − r(x is; w)|| 22 (5.102)
E
while our mo modeldel is defined J (to
w)hav = e w with xnorm
have r(x ; w)and reconstruction function
one (5.102)
r(x) = w >xw xw.. || − ||
while our model is defined to have w with norm one and reconstruction function
r(x)In=some
w xw cases,
. the cost function may be a function that we cannot actually
ev
evaluate,
aluate, for computational reasons. In these cases, we can still approximately
In some
minimize cases,iterativ
it using the cost
iterative function optimization
e numerical may be a function so longthat as ww e ehav
cannot
havee someactually
wa
wayy of
ev aluate,
appro
approximating for computational
ximating its gradients. reasons. In these cases, we can still approximately
minimize it using iterative numerical optimization so long as we have some way of
Most macmachinehine learning algorithms make use of this recipe, though it ma may y not
approximating its gradients.
immediately be obvious. If a mac machinehine learning algorithm seems esp especially
ecially unique or
Most machine learning algorithms make use of this recipe, though it may not
immediately be obvious. If a machine learning 153 algorithm seems esp ecially unique or
CHAPTER 5. MACHINE LEARNING BASICS

hand-designed, it can usually be understo


understoood as using a sp
special-case
ecial-case optimizer. Some
mo
models
dels suc
suchh as decision trees or k -means require sp special-case
ecial-case optimizers because
hand-designed,
their it can
cost functions hausually
hav be understo
ve flat regions od as them
that make using inappropriate
a special-case for
optimizer. Some
minimization
mo
b dels such as decision
y gradient-based trees or
optimizers. k -means require
Recognizing special-case
that most machine optimizers because
learning algorithms
their cost functions
can be describ
described hav e flat regions that make them inappropriate for minimization
ed using this recipe helps to see the different algorithms as part of a
by gradient-based
taxonom
taxonomy y of metho optimizers.
methodsds for doingRecognizing
related tasksthat most
that machine
work learning
for similar algorithms
reasons, rather
can b e describ ed using this recipe helps
than as a long list of algorithms that eac to
each see
h ha
havthe different algorithms
ve separate justifications. as part of a
taxonomy of methods for doing related tasks that work for similar reasons, rather
than as a long list of algorithms that each have separate justifications.
5.11 Challenges Motiv
Motivating
ating Deep Learning

5.11
The simpleChallenges
mac
machine
hine learning Motiv ating
algorithms Deep
describ ed in Learning
described this chapter work very well on
a wide variet
arietyy of important problems. Ho Howwev
ever,
er, they hahavve not succeeded in solving
The simple
the cen
central mac hine learning algorithms describ
tral problems in AI, such as recognizing sp ed in
speecthis
eec
eech chapter
h or work very
recognizing ob well on
objects.
jects.
a wide variety of important problems. However, they have not succeeded in solving
the The dev
development
central elopmentinofAI,
problems deep
suchlearning was motiv
motivated
as recognizing ated
speec inrecognizing
h or part by theobfailure
jects. of
traditional algorithms to generalize well on suc such
h AI tasks.
The development of deep learning was motivated in part by the failure of
This section
traditional is ab
algorithmsabout
outtohow the challenge
generalize of suc
well on generalizing
h AI tasks. to new examples becomes
exp
exponen
onen
onentially
tially more difficult when working with high-dimensional data, and how
the This
mec section isused
mechanisms
hanisms abouttohowac the echallenge
achiev
hiev
hieve of generalizing
generalization to new examples
in traditional machine b
machine ecomes
learning
exponen
are tially tmore
insufficien
insufficient to learndifficult when working
complicated withinhigh-dimensional
functions high-dimensionaldata, andSuch
spaces. how
the mec
spaces hanisms
also used tohigh
often impose achiev e generalization
computational costs.in Deep
traditional
learningmachine
was learning
designed to
are insufficien t to learn complicated
overcome these and other obstacles. functions in high-dimensional spaces. Such
spaces also often impose high computational costs. Deep learning was designed to
overcome these and other obstacles.
5.11.1 The Curse of Dimensionalit
Dimensionality
y

5.11.1
Man
Many y mac The
machine Curse of
hine learning Dimensionalit
problems y
become exceedingly difficult when the numnumbb er
of dimensions in the data is high. This phenomenon is kno known
wn as the curse
Man y mac hine learning
of dimensionality problems b ecome exceedingly
dimensionality.. Of particular concern is that the num difficult
umb when the distinct
ber of possible numb er
of dimensions in the data is high. This phenomenon
configurations of a set of variables increases exp
exponentially is kno wn
onentially as the num
numb as the curse
b er of variables
of dimensionality. Of particular concern is that the number of possible distinct
increases.
configurations of a set of variables increases exponentially as the numb er of variables
increases.

154
CHAPTER 5. MACHINE LEARNING BASICS

×
= 1000
10
d v
O(×
v ) = 1000
10
d v
O( v )

The curse of dimensionality arises in many places in computer science, and


esp
especially
ecially so in machine learning.
The curse of dimensionality arises in many places in computer science, and
especiallychallenge
One so in machineposed learning.
by the curse of dimensionality is a statistical challenge.
As illustrated in Fig. 5.9, a statistical challenge arises because the num number
ber of
One challenge posed by the curse
possible configurations of x is much larger than the num of dimensionality umb is a statistical challenge.
ber of training examples.
As illustrated in Fig. 5.9 , a statistical
To understand the issue, let us consider that the input space challenge arises because the number
is organized into of
into a
possible
grid, likeeconfigurations
lik in the figure. of Inxlo isw m
low uch larger w
dimensions than
e can thedescrib
number
describe of training
e this space withexamples.
a low
T
no
um
umbunderstand
ber of grid the cellsissue,
that arelet mostly
us consider
occupiedthat bthe
y the input space
data. When is organized
generalizinginto
to aa
grid,data
new like inpointhe
t, figure.
oint, we can In low dimensions
usually tell what towdo e can describ
simply by einspecting
this spacethe with a low
training
num b er of grid cells that are mostly
examples that lie in the same cell as the new input. Fo ccupied by the data.
For When generalizing
or example, if estimating to a
newprobabilit
the data poinyt,densit
probability we can
density y atusually
some ptell ointwhat
x, wetocan do just
simply by inspecting
return the num
numb berthe
of training
training
examples that lie in the same cell as the new
examples in the same unit volume cell as x , divided by the total num input. F or example,
umb if estimating
ber of training
the probabilit y densit y at some p oint x , w e can just
examples. If we wish to classify an example, we can return the most common return the num b er of training
class
examples in the same unit volume cell as x , divided
of training examples in the same cell. If we are doing regression we can av by the total num b er of training
average
erage
examples. If we
the target values observwish to
observedclassify an example, we can return
ed over the examples in that cell. But what ab the most common outclass
about the
of training
cells for which examples
we hahavvin the same
e seen cell. If we
no example? are doing
Because regression we can
in high-dimensional average
spaces the
the
num
umb target v alues observ ed o v er the
ber of configurations is going to be huge, muc examples in that
much cell. But what
h larger than our num ab out
umb the
ber of
cells for which we ha v e seen
examples, most configurations will hav no example? Because in high-dimensional
havee no training example asso ciated withthe
associated spaces it.
number of configurations is going to be huge, much larger than our number of
examples, most configurations will hav155 e no training example associated with it.
CHAPTER 5. MACHINE LEARNING BASICS

Ho
How
w could we possibly say something meaningful ab about
out these new configurations?
Man
Manyy traditional machine learning algorithms simply assume that the output at a
How pcould
new we possibly
oint should say something
be approximately themeaningful aboutput
same as the out these newnearest
at the configurations?
training
Man
poin y
oint.
t. traditional machine learning algorithms simply assume that the output at a
new point should be approximately the same as the output at the nearest training
point.
5.11.2 Lo
Local
cal Constancy and Smo
Smoothness
othness Regularization

5.11.2
In order toLo cal Constancy
generalize well, mac andlearning
machine
hine Smoothness algorithms Regularization
need to be guided by prior
beliefs ab aboutout what kind of function they should learn. Previously Previously,, we hav havee seen
In order to generalize
these priors incorpincorporated w ell, mac hine learning algorithms need to
orated as explicit beliefs in the form of probability distributions be guided by prior
obveliefs about what
er parameters kind
of the mo of
model. function
del. they should
More informally
informally, , welearn. Previously
may also discuss ,prior
we hav e seen
beliefs as
these priors incorp orated as explicit b eliefs in the form of
directly influencing the function itself and only indirectly acting on the parameters probability distributions
overtheir
via parameters
effect onofthethefunction.
model. More informally
Additionally
dditionally, , w, ewinformally
e may also discuss
discuss prior
prior bbeliefs
eliefs as
as
bdirectly influencing
eing expressed the function
implicitly
implicitly, itself and
, by choosing only indirectly
algorithms that areacting
biasedontow theard
toward parameters
choosing
via their effect on the function. A dditionally , w e informally
some class of functions over another, even though these biases may not be expressed discuss prior b eliefs as
being
(or ev
evenexpressed
en possibleimplicitly
to express) , byinchoosing
terms ofalgorithms
a probability thatdistribution
are biased represen
toward cting
representinghoosing
our
some class
degree of bof functions
elief in various overfunctions.
another, even though these biases may not be expressed
(or even possible to express) in terms of a probability distribution representing our
Among the most widely used of these implicit “priors” is the smo smoothness
othness prior
degree of belief in various functions.
or loloccal constancy priorprior.. This prior states that the function we learn should not
Among
change very muc the most
muchh withinwidely used of
a small these implicit “priors” is the smoothness prior
region.
or local constancy prior. This prior states that the function we learn should not
Man
Many
change vyery
simpler
much algorithms
within a small rely region.
exclusiv
exclusively ely on this prior to generalize well, and
as a result they fail to scale to the statistical challenges inv involved
olved in solving AI-
lev
levelMan y simpler algorithms rely exclusiv
el tasks. Throughout this book, we will describ ely on this
describee ho prior
w deep generalize
how to well, and
learning introduces
as a result (explicit
additional they fail to andscale to thepriors
implicit) statistical
in order challenges involved
to reduce in solving AI-
the generalization
level tasks.
error Throughout
on sophisticated this bHere,
tasks. ook, we we will describ
explain wh
why ye the
howsmodeep learning
smoothness
othness introduces
prior alone is
additional
insufficien (explicit
insufficientt for these tasks. and implicit) priors in order to reduce the generalization
error on sophisticated tasks. Here, we explain why the smoothness prior alone is
There are many differen differentt ways to implicitly or explicitly express a prior belief
insufficient for these tasks.
that the learned function should b e smooth or lo cally constan
locally constant. t. All of these different
metho There
methods are many differen t wa ys to
ds are designed to encourage the learning pro implicitly or explicitly
process
cess to learnexpress a prior
a function f ∗bthat
elief
that the learned
satisfies function should b e smooth or locally constant. All of these different
the condition
methods are designed to encourage f ∗ (xthe
) ≈learning
f∗ (x + pro ) cess to learn a function f(5.103) that
satisfies the condition
for most configurations x and small f (x)change f (x.+In ) other words, if we kno know w (5.103)
a go
goood
answ
answer er for an input x (for example, if≈x is a lab labeled
eled training example) then that
for
answ most
answer configurations
er is probably go x and small
goood in the neigh
neighb change
borho 
orhoood of x. other
. In words,
If we hav
have if we go
e several kno
o dwanswers
goo a good
answ
in er for
some an b
neigh
neighbinput
orho
orhoo oxd(for
we example,
would com combinex is athem
if bine labeled(b
(byytraining
some form example)
of av then that
averaging
eraging or
answ
in
interp
terperolation)
is probably
terpolation) to pro goduce
produceod inanthe neighbthat
answer od of xwith
orhoagrees . If weas hav
man
manye yseveral
of them gooasd manswers
uc
uchh as
pinossible.
some neighborhood we would combine them (by some form of averaging or
interpolation) to pro duce an answer that agrees with as many of them as much as
An extreme example of the lo local
cal constancy approac approach h is the k -nearest neighbors
possible.
156
An extreme example of the local constancy approach is the k -nearest neighbors
CHAPTER 5. MACHINE LEARNING BASICS

family of learning algorithms. These predictors are literally constan constantt over eac each h
region containing all the points x that hav havee the same set of k nearest neighbors in
family
the of learning
training set. For algorithms. These
k = 11,, the num
number berpredictors are literally
of distinguishable regions constan
cannot t ovbeer more
each
regionthe
than containing
num
numb ber of all training
the points x k
that have the same set of nearest neighbors in
examples.
the training set. For k = 1, the number of distinguishable regions cannot be more
While the k-nearest neighbors algorithm copies the output from nearby training
than the number of training examples.
examples, most kernel mac machines
hines interpolate betw between een training set outputs asso associated
ciated
While the k -nearest
with nearby training examples. An imp neighbors algorithm
important copies the output from
ortant class of kernels is the family of lo nearby training
loccal
examples, most kernel mac hines interpolate betw
kernels where k(u, v ) is large when u = v and decreases as u and v gro een training set outputs growasso ciated
w farther
with nearby
apart from eac training
each examples.
h other. A lo local An imp ortant class of kernels
cal kernel can be thought of as a similarity function is the family of local
kernels
that where ktemplate
performs (u, v ) is large when u
matching, by=measuring
v and decreases ho
how as u and
w closely v gro
a test w farther
example x
apart
resem
resemblesfrom eac h other. A lo
bles each training example x . Muc cal kernel
( i ) can
Much b e thought of
h of the modern motiv as a similarity
motivation function
ation for deep
that p erforms
learning is deriv derivedtemplate matching, by
ed from studying the limitations of lomeasuring ho w
local closely a test
cal template matching example andx
resem
ho
how bles
w deep mo each
models training example x .
dels are able to succeed in cases where loMuc h of the modern
local motiv ation
cal template matching fails for deep
(learning
Bengio etis al. deriv ed from
, 2006b ). studying the limitations of local template matching and
how deep models are able to succeed in cases where local template matching fails
Decision
(Bengio et al.trees
, 2006b also
). suffer from the limitations of exclusively smoothness-based
learning because they break the input space into as many regions as there are
lea
leavvDecision
es and use trees also suffer
a separate from the
parameter (orlimitations
sometimes of man
manyexclusively
y parameters smoothness-based
for extensions
learning
of decision because
trees) theyin eac break
each the input
h region. If thespace targetinto as many
function regions
requires as there
a tree withare at
lea v es and
least n lealeav use a separate parameter
ves to be represented accurately (or sometimes man y parameters
accurately,, then at least n training examples are for extensions
required to fit the tree. A multiple ofIf nthe
of decision trees) in eac h region. target to
is needed function
ac hievee requires
achiev
hiev some level a oftree with at
statistical
least n lea ves to b e represented
confidence in the predicted output. accurately , then at least n training examples are
required to fit the tree. A multiple of n is needed to achieve some level of statistical
In general, to distinguish O( k) regions in input space, all of these metho methods ds
confidence in the predicted output.
require O (k ) examples. Typically there are O( k) parameters, with O (1) parameters
asso In
associatedgeneral,
ciated with to eachdistinguish
of the O( k O)( kregions.
) regions The in case
inputofspace,
a nearest all of these metho
neighbor scenario,ds
require O ( k ) examples. Typically there are O ( k) parameters,
where each training example can be used to define at most one region, is illustrated with O (1) parameters
asso
in ciated
Fig. 5.10with
. each of the O( k ) regions. The case of a nearest neighbor scenario,
where each training example can be used to define at most one region, is illustrated
Is there
in Fig. 5.10.a wa way y to represen
representt a complex function that has many more regions
to be distinguished than the num umb ber of training examples? Clearly Clearly,, assuming
Is
only smo there
smoothness a wa y to represen t a complex function
othness of the underlying function will not allow a learner that has many moreto doregions
that.
toorbeexample,
F distinguished imagine than thethe
that num ber offunction
target trainingisexamples?a kind of Clearlychec , assuming
checkerboard.
kerboard. A
only
chec
heck smo
kerb othness
erboard
oard con of
contains the
tains man underlying
many function will not allow
y variations but there is a simple structure to them. a learner to do that.
Imagine what happens whenthe
F or example, imagine that thetarget
num
umb bfunction
er of training is a kindexamplesof chec is kerboard.
substan
substantially
tiallyA
checkerbthan
smaller oardthe conntains
um
umb berman
of yblac
variations
black k and white but there
squares is ona simple
the chec structure
checkerboard.
kerboard. to Based
them.
Imagine
on only lo what
local happens when
cal generalization andthethensmo umbothness
smoothnesser of training
or local examples
constancy is substan
prior, tially
we would
smaller
b than the
e guaranteed to n umber ofguess
correctly blackthe and colorwhiteof asquares
new p oin ont the
oint if itchec
lieskerboard.
within theBased same
on
chec
heckonly
kerb lo
erboardcal generalization and the smo othness
oard square as a training example. There is no guaran or local constancy
guarantee prior, w e
tee that the learnerwould
b e guaranteed to correctly
could correctly extend the chec guess
heckkerbthe
erboard color of a
oard pattern to poin new p oin
ointsts lying in within
t if it lies squaresthe same
that do
checcon
not kerbtain
containoardtraining
square examples.
as a training With example.
this prior There is no
alone, theguaran tee that the that
only information learneran
could correctly extend the checkerboard pattern to points lying in squares that do
not contain training examples. With this 157prior alone, the only information that an
CHAPTER 5. MACHINE LEARNING BASICS

158
CHAPTER 5. MACHINE LEARNING BASICS

example tells us is the color of its square, and the only wa way y to get the colors of the
en
entire
tire chec
heck kerb
erboard
oard right is to co cov
ver eac
eachh of its cells with at least one example.
example tells us is the color of its square, and the only way to get the colors of the
The smo
smoothness
othness assumption and the associated non-parametric learning algo-
entire checkerboard right is to cover each of its cells with at least one example.
rithms work extremely well so long as there are enough examples for the learning
The smo
algorithm toothness
observeassumption
high pointsand on themost associated
peaks and non-parametric
lo
loww poinointsts on learning
most valleys algo-
rithms
of workunderlying
the true extremely well so long
function to asbe there
learned.are enough
This is examples
generallyfor true thewhen
learning
the
algorithm to observe
function to be learned is smo high p oints
smooth on most p eaks and lo w p oin
oth enough and varies in few enough dimensions. ts on most valleys
In high dimensions, even a very to
of the true underlying function smo beoth
smooth learned.
function This
canischange
generally true when
smoothly but in thea
function
differen to be learned is smo oth enough and v
differentt way along each dimension. If the function additionally behav aries in few enough
ehaves dimensions.
es differently
In different
in high dimensions,
regions, itevencan ab ecome
very smo oth function
extremely can change
complicated smoothly
to describ
describe e withbut a setin ofa
different examples.
training way along each If thedimension.
function isIfcomplicated
the function (w additionally
(we e wan behaves differently
antt to distinguish a huge
in
num different
umb regions, it can b ecome extremely complicated
ber of regions compared to the number of examples), is there any hope to describ e with a set to
of
training examples.
generalize well? If the function is complicated (w e w an t to distinguish a h uge
number of regions compared to the number of examples), is there any hope to
The answer to both of these questions is yes. The key insigh insightt is that a very
generalize well? k
large numumb ber of regions, e.g., O(2 ), can be defined with O (k) examples, so long
The answer
as we introduce some to bothdep of these questions
dependencies
endencies bet
etwween is theyes. The via
regions keyadditional
insight is assumptions
that a very
large
ab outnthe
about umbunderlying
er of regions, data e.g., O(2 ), can
generating be defined with
distribution. In this O (wa
k) yexamples,
way so long
, we can actually
as we introduce
generalize non-lo some
callydep
non-locally endencies
(Bengio and bMonp
etween
Monperrus the, regions
errus 2005; Bengio via additional
et al., 2006c assumptions
). Man
Many y
ab out
differen the underlying data
differentt deep learning algorithms pro generating
provide distribution. In this wa y ,
vide implicit or explicit assumptions that arewe can actually
reasonable for a broad range of AI tasks errus
generalize non-lo cally ( Bengio and Monp in order, 2005to ;capture
Bengio theseet al.,adv 2006c ). Many
advantages.
antages.
different deep learning algorithms provide implicit or explicit assumptions that are
Other approac
reasonable approaches hes to
for a broad machine
range of AIlearning
tasks in often
order makmake e stronger,
to capture thesetask-sp
task-specific
ecific as-
advantages.
sumptions. For example, we could easily solv solvee the chec heck kerb
erboard
oard task by pro providing
viding
Other approac hes to machine
the assumption that the target function is perio learning often
periodic. mak e stronger, task-sp
dic. Usually we do not include suc ecific as-
such h
sumptions.
strong, task-sp F or
task-specific example, w e
ecific assumptions in could easily
into solv
to neural netw e the
networks chec kerb oard task
orks so that they can generalizeby pro viding
the assumption
to a muc uch that the target function
h wider variety of structures. AI tasks ha is perio dic.havUsually
ve structure we dothat not include
is muc uchhsuc
tooh
strong, task-sp
complex to be ecific
limitedassumptions into neural
to simple, manually sp netw
specified
ecified orksprop soerties
that they
properties suc
such h can
as pgeneralize
erio
eriodicity
dicity
dicity,,
to a m
so we wanuc h wider variety of structures.
antt learning algorithms that embo AI
embody tasks ha ve structure
dy more general-purpose assumptions.that is m uc h too
complex
The core to
ideabe in
limited
deep to simple, ismanually
learning that we sp ecified that
assume properties
the data suchwas as pgenerated
eriodicity,
so w
by the e wan t learning algorithms that embo dy
or features, p otenmore general-purpose
otentially
tially at multiple lev assumptions.
levels
els in a
The
hierarccore
hierarch hy.ideaMany in deep
otherlearning
similarlyisgeneric
that weassumptions
assume thatcan thefurther
data wimpro as generated
improv ve deep
b y the
learning algorithms. These apparen or
apparently features, p oten
tly mild assumptions allo tially at
allow multiple
w an exp exponen
onenlevtial in
onentialels gain a
hierarc
in hy. Many other
the relationship bet
etwwsimilarly
een the num generic
number ber ofassumptions
examples and canthe further
num
numb bimpro
er of vregions
e deep
learning algorithms. These
that can be distinguished. These exp apparen tly mild
exponential assumptions
onential gains are describ allo w
described an exp onen
ed more precisely tial gainin
in the6.4.1
Sec. relationship
, Sec. 15.4b, et andween Sec.the num
15.5 . Theber exp
of examples
exponential
onential adv andan
advan the
antages
tages num ber of regions
conferred by the
thatofcan
use be distinguished.
deep, These exponential
distributed representations coun
counter gains
ter theareexp describ
exponen
onen
onential ed more
tial precisely
challenges posed in
Sec.
b y the6.4.1 , Sec.
curse 15.4, and Sec. . 15.5. The exponential advantages conferred by the
of dimensionality
dimensionality.
use of deep, distributed representations counter the exponential challenges posed
by the curse of dimensionality.

159
CHAPTER 5. MACHINE LEARNING BASICS

5.11.3 Manifold Learning

5.11.3
An imp Manifold
important
ortant conceptLearning
underlying man many y ideas in machine learning is that of a
manifold.
An important concept underlying many ideas in machine learning is that of a
A manifold is a connected region. Mathematically
manifold. Mathematically,, it is a set of p oin oints,
ts, associated
with a neighborho
neighborhooo d around each p oint. F rom an
any
y given p oint, the manifold lo
locally
cally
app A
appearsmanifold is a connected region.
ears to be a Euclidean space. In everyda Mathematically
everyday , it
y life, we exp is a set
experience of p oin ts, associated
erience the surface of the
with
w orlda as
neighborho
a 2-D plane, o d around
but it each
is in pfact
oint.a F rom any manifold
spherical given point, the manifold
in 3-D space. locally
appears to be a Euclidean space. In everyday life, we experience the surface of the
The definition of a neigh neighbborho
orhoo od surrounding each poin ointt implies the existence
world as a 2-D plane, but it is in fact a spherical manifold in 3-D space.
of transformations that can be applied to mo movve on the manifold from one position
to aThe definitionone.
neighboring of aInneigh
the bexample
orhood surrounding
of the world’s each pointasimplies
surface the existence
a manifold, one can
of transformations that
walk north, south, east, or west.can b e applied to mo ve on the manifold from one position
to a neighboring one. In the example of the world’s surface as a manifold, one can
Although there is a formal mathematical meaning to the term “manifold,”
walk north, south, east, or west.
in mac
machine
hine learning it tends to be used more lo loosely
osely to designate a connected
Although
set of poin
oints there is a
ts that can be approformal mathematical
approximated meaning
ximated well by considering to the
onlyterm
a small“manifold,”
num
umberber
in mac hine learning it tends to b e used more lo osely to
of degrees of freedom, or dimensions, embedded in a higher-dimensional space. designate a connected
set
Eac
Eachof dimension
h points thatcorresp
can bonds
e appro
corresponds to ximated
a lo
local well by considering
cal direction of variation.only See aFig.
small
5.11num
forber
an
of degrees
example of of freedom,
training dataorlying
dimensions, embedded in amanifold
near a one-dimensional higher-dimensional
em bedded inspace.
embedded two-
Each dimension
dimensional space.corresp
In the onds to a of
context local
macdirection
machine of variation.
hine learning, we allowSee theFig. 5.11 for any
dimensionalit
dimensionality
example
of of training
the manifold to vdata lying one
ary from nearp oin
aoint
one-dimensional
t to another. This manifold
often em bedded
happ
happens in two-
ens when a
dimensional
manifold in space.
intersects In the context of mac hine
tersects itself. For example, a figure eigh learning,
eightt is a manifold that has a singley
we allow the dimensionalit
of the manifold
dimension in most to places
vary frombut twone
two p oint to another.
o dimensions at the in This often happ
intersection
tersection at theenscenter.
when a
manifold intersects itself. For example, a figure eight is a manifold that has a single
dimension in most places but two dimensions at the intersection at the center.

160
CHAPTER 5. MACHINE LEARNING BASICS

Man
Many y mac
machine
hine learning problems seem hop hopeless
eless if we exp expect
ect the machine
learning algorithm to learn functions with interesting variations across all of
Rn. Man y macle
Manifold hine learning
learning
arning problems
algorithms surmounseem
surmount hopobstacle
t this eless if wbey exp ect thethat
assuming machine
most
learning
of n algorithm to learn functions with interesting variations across all of
R R consists of in invvalid inputs, and that in interesting
teresting inputs o ccur only along
. Manifold
a collection learning algorithms
of manifolds con containing surmoun t this obstacle
taining a small subset of poin b y assuming
oints, that
ts, with interesting most
R
vofariationsconsists
in theofoutput
invalidofinputs, and that
the learned interesting
function occurring inputs
onlyoalong
ccur only along
directions
a collection
that lie on the of manifold,
manifolds or con taining
with a smallvariations
interesting subset of happ poinening
ts, with
happening onlyinteresting
when we
vmoariations
mov in the output of the learned function o ccurring
ve from one manifold to another. Manifold learning was introduced in the case only along directions
that
of con lie on the manifold,
continuous-v
tinuous-v
tinuous-valuedalued dataorand with theinteresting
unsup
unsupervisedvariations
ervised learninghapp ening only
setting, whenthis
although we
move fromyone
probabilit
probability manifold toidea
concentration another.
can bManifold learning
e generalized was introduced
to both discrete data in the
andcase
the
of
sup con tinuous-v
supervised alued data and the unsup ervised learning
ervised learning setting: the key assumption remains that probability mass is setting, although this
probabilit
highly concen y concentration
concentrated.
trated. idea can be generalized to both discrete data and the
supervised learning setting: the key assumption remains that probability mass is
The assumption that the data lies along a low-dimensional manifold may not
highly concentrated.
alw
alwa ays be correct or useful. We argue that in the context of AI tasks, suc such h as
The assumption
those that inv involve
olve prothat
cessing images, sounds, or text, the manifold assumptionnot
processingthe data lies along a low-dimensional manifold may is
alw ays b
at least approe correct
approximately or useful. W e argue that
ximately correct. The evidence in fav in the
favor context of AI tasks,
or of this assumption consists suc h as
those that inv olve
of two categories of observpro cessing
observations.images,
ations. sounds, or text, the manifold assumption is
at least approximately correct. The evidence in favor of this assumption consists
The first observ
observation
ation in fav
favor
or of the manifold hyp hypothesis
othesis is that the probability
of two categories of observations.
distribution over images, text strings, and sounds that occur in real life is highly
concenThetrated.
first observ
concentrated. ationnoise
Uniform in favessentially
or of the manifold
nev
never hypothesisstructured
er resembles is that theinputs
probability
from
distribution ov er images,
these domains. Fig. 5.12 sho text
shows strings,
ws ho
how, and sounds that o ccur
w, instead, uniformly sampled points lo in real life
look is highly
ok like the
concentrated.
patterns of staticUniform
that app noise
ear essentially
appear never resembles
on analog television sets when structured
no signalinputs from
is available.
these
Similarlydomains.
Similarly, , if youFig. 5.12 sho
generate a dowscumen
how, tinstead,
documen
cument uniformly
by picking letters sampled
uniformly points look likewhat
at random, the
patterns
is of staticy that
the probabilit
probability that app
youear onget
will analog television English-language
a meaningful sets when no signal is avAlmost
text? ailable.
Similarly , if you generate a do cumen t by picking letters
zero, again, because most of the long sequences of letters do not corresp uniformly at random,
correspond ondwhat
to a
is the probabilit
natural languageysequence:
that you the willdistribution
get a meaningful
of naturalEnglish-language
language sequences text?oAlmost
ccupies
azero,
veryagain,
small bvolume
ecause most
in theoftotal
the space
long sequences
of sequences of letters do not correspond to a
of letters.
natural language sequence: the distribution of natural language sequences occupies
a very small volume in the total space of sequences of letters.

161
CHAPTER 5. MACHINE LEARNING BASICS

Of course, concentrated probabilit


probability
y distributions are not sufficien
sufficientt to sho
show
w
that the data lies on a reasonably small number of manifolds. We must also
Of course,
establish concentrated
that the probabilit
examples we y distributions
encounter are to
are connected not sufficien
each othert by
each to other
show
that the data lies on a reasonably small number of manifolds. We must also
establish that the examples we encounter
162 are connected to each other by other
CHAPTER 5. MACHINE LEARNING BASICS

examples, with each example surrounded by other highly similar examples that
ma
may y be reached by applying transformations to trav traverse
erse the manifold. The second
examples,
argumen
argumentt in fa withfav each example
vor of the manifold hyp surrounded by
ypothesis other
othesis is that highly
we cansimilar
alsoexamples
imagine suc that
suchh
ma
neighy
neighb b e reached
borho
orhoo b y applying transformations
ods and transformations, at least informally to trav erse the manifold. The
informally.. In the case of images, we second
argumen t in fa vor of the manifold
can certainly think of many possible transformations h yp othesis is thatthatwe allo
canwalso
allow us toimagine
trace out sucha
neighborho
manifold inods and space:
image transformations,
we can gradually at leastdiminformally . In the
or brighten thecase of images,
lights, gradually we
canvecertainly
mo
mov or rotatethink ob
objectsof many
jects in thepossible
image, transformations
gradually alter the thatcolors
allowon ustheto trace
surfaces outofa
manifold
ob
objects,
jects, etc. in image space:lik
It remains w e can
likely
ely thatgradually
there aredim or brighten
multiple manifolds the inv
lights,
involved
olved gradually
in most
applications. For example, the manifold of images of human faces may not bofe
mo v e or rotate ob jects in the image, gradually alter the colors on the surfaces
ob jects, etc.
connected to It
theremains
manifold likely that there
of images of catarefaces.
multiple manifolds involved in most
applications. For example, the manifold of images of human faces may not b e
These thought exp experiments
eriments supp supporting
orting the manifold hypotheses conv conveyey some in-
connected to the manifold of images of cat faces.
tuitiv
tuitivee reasons supp supporting
orting it. More rigorous exp experimen
erimen
eriments ts ((Ca
Ca
Cayton
yton, 2005; NaraNaray yanan
These thought
and Mitter, 2010; Sc exp
Schölkeriments
hölk
hölkopf
opf et al.supp orting
al.,, 1998; Ro the
Row manifold hypotheses
weis and Saul, 2000; Tenen conv ey
enenbaum some
baum et al. in-,
al.,
tuitiv
2000 ; eBrand
reasons supp
, 2003 orting and
; Belkin it. More
Niy
Niyogi rigorous
ogi , 2003; exp erimen
Donoho andts (Grimes
Cayton, ,20032005; ;WNara yanan
einberger
and Mitter
and Saul, ,2004 2010) ; clearly
Schölkopf supp etort
support al., the
1998h;yp Ro weis and
ypothesis
othesis forSaul , 2000class
a large ; Tenenof baum
datasets et al.
of,
2000
in ; Brand
interest
terest in AI., 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger
and Saul, 2004) clearly support the hypothesis for a large class of datasets of
When the data lies on a low-dimensional manifold, it can be most natural
interest in AI.
for mac
machine
hine learning algorithms to represent the data in terms of coordinates on
When
the manifold, the rather
data lies than oninaterms
low-dimensional
of cocoordinates
ordinatesmanifold,
in R n . Init can
ev be ymost
everyda
eryda
eryday life, natural
we can
for mac hine learning algorithms to
think of roads as 1-D manifolds embedded in 3-D space. represent the data in terms
We give directions on
of coordinates to
R
the
sp manifold,
specific rather than
ecific addresses in terms of address numin terms of co ordinates
umb in . In everyda y
bers along these 1-D roads, not in terms life, we can
think
of co of roads in
coordinates
ordinates as3-D1-Dspace.
manifolds embedded
Extracting theseinmanifold
3-D space. co We giveisdirections
coordinates
ordinates challenging, to
specific
but holds addresses
the promise in terms of address
to impro
improv ve man
many nyum bers along
machine thesealgorithms.
learning 1-D roads, not Thisingeneral
terms
of coordinates
principle in 3-D
is applied in space.
man
many Extracting
y con
contexts.
texts. Fig. these
5.13manifold
shows the coordinates
manifold is challenging,
structure of a
but holds the promise to impro ve man y
dataset consisting of faces. By the end of this book, we will hamachine learning algorithms.
hav ve dev This
develop
elop
elopedgeneral
ed the
principle
metho
methods ds is applied in
necessary to man
learny con
suc
suchtexts.
h a manifoldFig. 5.13 shows theInmanifold
structure. Fig. 20.6structure
, we will ofseea
dataset
ho
how consisting
w a machine of faces.
learning By thecan
algorithm endsuccessfully
of this book, we will ha
accomplish ve dev
this eloped the
goal.
methods necessary to learn such a manifold structure. In Fig. 20.6, we will see
howThis concludes
a machine Part Ialgorithm
learning , which has can provided the basic
successfully concepts
accomplish thisingoal.
mathematics
and macmachine
hine learning which are emplo employ yed throughout the remaining parts of the
book. This Youconcludes
are nonowwPprepared
art I, which has provided
to embark up on ythe
upon ourbasic
studyconcepts in mathematics
of deep learning.
and machine learning which are employed throughout the remaining parts of the
book. You are now prepared to embark upon your study of deep learning.

163
CHAPTER 5. MACHINE LEARNING BASICS

164
Part II
Part II
Deep Net
Netw
works: Mo
Modern
dern
Practices
Deep Networks: Modern
Practices

165

165
This part of the book summarizes the state of mo modern
dern deep learning as it is
used to solv
solvee practical applications.
This part of the book summarizes the state of modern deep learning as it is
Deep learning has a long history and man many y aspirations. Sev Several
eral approac
approaches hes
used to solve practical applications.
ha
havve been proposed that hav havee yet to en entirely
tirely bear fruit. Sev Several
eral am
ambitious
bitious goals
ha
havve yet to be realized. These less-developed branches of deep learningapproac
Deep learning has a long history and man y aspirations. Sev eral appearhes in
havefinal
the beenpart
proposed
of the bo that
ok.have yet to entirely bear fruit. Several ambitious goals
book.
have yet to be realized. These less-developed branches of deep learning appear in
This part focuses only on those approac approaches hes that are essen essentially
tially working tec tech-
h-
the final part of the book.
nologies that are already used hea heavily
vily in industry
industry..
This part focuses only on those approaches that are essentially working tech-
Mo
Modern
nologies dern
thatdeep learningused
are already pro
provides
vides
hea vilya invery pow
powerful
industry erful
. framew
framework ork for supervised
learning. By adding more lay layers
ers and more units within a la lay
yer, a deep netw
networkork can
Mo
represen dern deep learning pro vides
representt functions of increasing complexity a v ery pow erful framew ork
complexity.. Most tasks that consist of mapping an for supervised
input vector to an output vector,and
learning. By adding more lay ers andmore
that units within
are easy for aa la yer, a to
person deep
do netw ork, can
rapidly
rapidly, can
represen
b t functions
e accomplished viaofdeep
increasing complexity
learning, giv . Most tly
en sufficien
given sufficiently tasks thatmo
large consist of mapping
dels and
models sufficiently an
input datasets
large vector toofanlabeled
outputtraining
vector, examples.
and that are easytasks,
Other for a that
person
cantonot
dobrapidly , can
e described
be asso
as accomplished
associating
ciating one viavector
deep learning,
to another, given sufficien
or that aretly large mo
difficult dels and
enough thatsufficiently
a p erson
large datasets of labeled training examples. Other tasks,
would require time to think and reflect in order to accomplish the task, remain that can not b e described
asey
b asso
ondciating
eyond the scop one
scope e ofvector to another,
deep learning orw.
for no
now. that are difficult enough that a p erson
would require time to think and reflect in order to accomplish the task, remain
This part of the book describ describeses the core parametric function approximation
beyond the scope of deep learning for now.
tec
technology
hnology that is behind nearly all modern practical applications of deep learning.
We This
beginpart
b
byy of the book the
describing describ es theard
feedforw
feedforward core parametric
deep net workfunction
network mo
model approximation
del that is used to
tec hnology
represen that is b ehind nearly all
representt these functions. Next, we present adv modern practical
advanced applications
anced tectechniques of deep learning.
hniques for regularization
We optimization
and begin by describing
of such mo the
models. feedforw
dels. Scalingardthese deepmo net
models work
dels model
to large thatsuch
inputs is used to
as high
represent these
resolution images functions. Next, wesequences
or long temporal present adv anced tec
requires sp hniques for regularization
specialization.
ecialization. We inintro
tro
troduce
duce
and optimization
the con
conv
volutional net of such
network mo dels. Scaling these mo dels
work for scaling to large images and the recurrento large inputs such as
recurrentt neural high
resolution
net
netwwork forimages
pro or long
processing
cessing temptemporal
temporal sequencesFinally
oral sequences. requires
Finally, , wspe ecialization.
present general We guidelines
introduce
the the
for conpractical
volutionalmetho network
methodology
dology forin
scaling
invvolved to large images
in designing, and the
building, recurren
and t neural
configuring an
net w ork for
application in pro cessing
involving temp oral sequences. Finally , w e present
volving deep learning, and review some of the applications of deep general guidelines
for the
learning.practical methodology involved in designing, building, and configuring an
application involving deep learning, and review some of the applications of deep
These chapters are the most imp ortant for a practitioner—someone who wants
important
learning.
to begin implemen
implementing ting and using deep learning algorithms to solv solvee real-w
real-world
orld
These
problems to chapters
todada
dayy. are the most imp ortant for a practitioner—someone who w ants
to begin implementing and using deep learning algorithms to solve real-world
problems today.

166
Chapter 6
Chapter 6
Deep Feedforw
eedforward
ard Net
Netw
works
Deep
De
Deep
ep fe
F
feeedforwar
dforward
eedforw ard Net
d networks, also often called fe
w
feeedforwar
dforward
orks
d neur
neural
al networks, or multi-
layer per ercceptr
eptrons
ons (MLPs), are the quintessen quintessential tial deep learning mo models.
dels. The goal
Deep
of feedforward netw
a feedforward networks
networkork ,isalso often called fesome
to approximate edforwar d neurfal∗ .networks
function F
For , or multi-
or example, for
layer p er ceptr ons ∗( MLPs ), are the quintessen tial deep
a classifier, y = f (x) maps an input x to a category y. A feedforward netw learning mo dels. The goal
network
ork
of a feedforward netw ork is to approximate some function
defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result f . F or example, for
a classifier,
in y = f (xapproximation.
the b est function ) maps an input x to a category y. A feedforward network
defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result
These
in the b estmomodels
dels areapproximation.
function called fe feeedforwar
dforward d b ecause information flo flows
ws through the
function b eing ev aluated from x, through the intermediate computations used to
evaluated
These mo
define f , and finallydels aretocalled
the outputfeedforwar d b ecause
y. There are noinformation
fe
feeedb
dback flows through
ack connections the
in whic
which h
function of
outputs b eing
the moev aluated
model
del are from
fed x
back , through
in
into
to the
itself. intermediate
When computations
feedforward neural used
netw
networks to
orks
define
are f , and finally
extended to the
to include output
feedbac
feedback ky . There are no
connections, they feeare
dback connections
called recurr
current
entin neur
whical
neuralh
outputs
networks,of
networks the mo delinare
, presented fed back
Chapter 10.into itself. When feedforward neural networks
are extended to include feedback connections, they are called recurrent neural
Feedforw
eedforward ard netw
networks
orks are of extreme imp importance
ortance to machine learning practi-
networks, presented in Chapter 10.
tioners. They form the basis of many important commercial applications. For
Feedforw
example, theardconvnetw orks are
convolutional
olutional net of extreme
networks
works impob
used for ortance
object to machine
ject recognition fromlearning
photospracti-
are a
tioners.
sp
specialized They form
ecialized kind of feedforwthe basis
feedforward of
ard net many
network. important
work. Feedforw
eedforward commercial
ard netw
networks applications.
orks are a conceptual For
example, the conv olutional
stepping stone on the path to recurren net works used for
recurrentt netw ob ject
networks, recognition
orks, which p ower man from photos
many y naturala
are
sp ecialized
language kind of feedforward network. Feedforward networks are a conceptual
applications.
stepping stone on the path to recurrent networks, which p ower many natural
Feedforw
eedforward ard neural net networks
works are called networks b ecause they are typically rep-
language applications.
resen
resented
ted by compcomposing
osing together many different functions. The mo modeldel is asso
associated
ciated
F eedforw ard neural net works are called networks
with a directed acyclic graph describing how the functions are comp b ecause they are
composed typically rep-
osed together.
Fresen ted by comp
or example, we osing
mightttogether
migh hav
havee three many differentffunctions.
functions (1) f (2)
, Thefmo
, and (3) del is asso ciated
connected in a
cwith
hain,a to
directed
form facyclic
(x ) = graph
f (3) (f describing
(2) (f (1) (x how the functions are comp osed together.
))).. These chain structures are the most
)))
For example,
commonly used westructures
might hav ofe neural
three functions
netw orks.fIn ,this
networks. f case, f is connected
, andf (1) in a
called the first
clayer
hain,oftothe form
netwf (x ) = f (f ( f
ork, f (2) is called the se
network, (x ))) . These
seccond layer chain structures are
layer,, and so on. The overall length the most
commonly used structures of neural networks. In this case, f is called the first
layer of the network, f is called the se 167cond layer, and so on. The overall length
167
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

of the chain giv gives


es the depth of the mo model.
del. It is from this terminology that the
name “deep learning” arises. The final lay layer
er of a feedforward net network
work is called the
of the c
output layerhain giv es the depth
layer.. During neural netw of
networkthe
ork training, we drive f(x) to match f ∗ (that
mo del. It is from this terminology x). The the
name “deep
training datalearning”
providesarises.
us with Thenoisyfinal
noisy, layer of a feedforward
, approximate examples ofnet f ∗work
(x) ev isaluated
called the
evaluated at
output
differen layer . During
differentt training p oints. Eac neural netw
Each ork training, we drive
h example x is accompanied by a lab f ( x ) to match
label f (x ).
el y ≈ f (x). ∗The
training
The dataexamples
training provides sp usecify
withdirectly
specify noisy, approximate
what the output examples
layer of
layer mustf (dox) atevaluated
each p oin att
oint
differen
x ; it must t training
pro ducep oints.
produce a valueEac thath example
is close to x yis. accompanied
The b ehavior by a lab
of the el y la
other f (xis
layers
yers ).
The directly
not training sp examples
specified
ecified by sp ecify directly what
the training data. theThe output
learning layeralgorithm
must do at musteach p oint
≈ decide
x
ho;
how it m ust pro
w to use those lay duce a
layersv alue
ers to pro that
produce is close to y. The b ehavior
duce the desired output, but the training data do of the other la yersdoesis
es
not sa
not directly
say y whatsp ecified
eac
each by the lay
h individual training
layerer should data.
do. The learning
Instead, algorithm
the learning must decide
algorithm must
ho w to use those
decide how to use these laylay ers to pro duce the desired output, but the
ers to b est implement an approximation of f . Because
layers training ∗ data do es
not training
the say whatdata each doindividual
does
es not sho lay
show werthe should do. output
desired Instead, fortheeachlearning
of these algorithm
lay
layers, must
ers, these
decide
la
layers how to use these
yers are called hidden layers. lay ers to b est implement an approximation of f . Because
the training data do es not show the desired output for each of these layers, these
Finally
Finally,, these netw networks
orks are called neur neuralal b ecause they are lo loosely
osely inspired by
layers are called hidden layers.
neuroscience. Eac Each h hidden lay layerer of the net network
work is typically vector-v ector-valued.
alued. The
Finally
dimensionalit
dimensionality , these netw orks
y of these hidden la are called
layers neur al b ecause they
yers determines the width of the mo are lo osely inspired
model.
del. Each by
neuroscience.
elemen Eac
elementt of the vector ma h hidden
may lay
y b e in er of
interpreted the net work is t ypically
terpreted as playing a role analogous to a neuron.v ector-v alued. The
dimensionalit
Rather y of these
than thinking hidden
of the lay
layererlaasyers determines
represen
representing the width
ting a single of the mo del.
vector-to-vector Each
function,
elemen
w e cant also
of the vector
think of ma
theylay b eerinterpreted
layer as consisting as playing
of man
many ay role
units analogous
that acttoinaparallel,
neuron.
Rather
eac
each than thinking
h representing of the layer as represen
a vector-to-scalar function. tingEaca single
Each h unitvector-to-vector
resem
resembles bles a neuronfunction, in
w e can also think of the lay er as consisting of
the sense that it receives input from many other units and computes its own man y units that act in parallel,
eac
activh ation
representing
activation value. The a vector-to-scalar
idea of using man function.
many y lay ersEac
layers of hvector-v
unit resem
ector-valuedalued bles a neuron in
representation
thedrawn
is sensefromthatneuroscience.
it receives input The fromchoicemany of theother
functionsunitsf (andi) (x computes its own
) used to compute
activ ation v alue. The
these representations is also lo idea of using
loosely man y lay ers
osely guided by neuroscien of vector-v
neuroscientific alued
tific observ representation
observations
ations ab about
out
is drawn from neuroscience. The
the functions that biological neurons compute. How choice of the functions
Howev ev
ever, f
er, mo (
modernx ) used to
dern neural netw compute
network ork
these representations
researc
research h is guided by isman alsoy lo
many osely guided b
mathematical y neuroscien
and engineering tificdisciplines,
observations and abthe
out
the functions
goal of neural that netw biological
orks is notneurons
networks to p erfectlycompute.
mo
modeldelHowtheevbrain.
er, moIt dernis bneural network
est to think of
researc
feedforw
feedforwardh is guided
ard net netw b y man y mathematical and engineering
works as function approximation machines that are designed to disciplines, and the
goal
ac
achieve of neural netw orks is not to opccasionally
hieve statistical generalization, erfectly modrawing
del the brain. It is b est
some insights from to what
think we of
feedforw
kno
know w ab ard the
about
out netwbrain,
orks as function
rather than approximation
as mo dels of brain
models machines
function. that are designed to
achieve statistical generalization, o ccasionally drawing some insights from what we
One wa way y to understand feedforward netw networks
orks is to b egin with linear mo modelsdels
know ab out the brain, rather than as mo dels of brain function.
and consider ho how
w to ov overcome
ercome their limitations. Linear mo models,
dels, sucsuchh as logistic
One wa y to understand
regression and linear regression, are app feedforward netw
appealing orks is to b egin with
ealing b ecause they may b e fit efficien linear mo dels
efficiently tly
and consider
and reliably
reliably,, how
either to
in ov ercome
closed form their
or limitations.
with conv
convex ex Linear
optimization. mo dels, suc
Linear h moas
models logistic
dels also
regression
ha
have and linear
ve the obvious defectregression,
that the are mo
model appcapacit
del ealing byecause
capacity is limitedtheytomay b e functions,
linear fit efficiently so
and
the mo reliably
model , either in closed
del cannot understand the in form or with
interactionconv
teraction b et ex
etwoptimization.
ween an any Linear mo
y two input variables. dels also
have the obvious defect that the mo del capacity is limited to linear functions, so
To extend linear mo models
dels to represen
representt nonlinear functions of x, we can apply
the mo del cannot understand the interaction b etween any two input variables.
the linear mo model del not to x itself but to a transformed input φ(x), where φ is a
To extend linear mo dels to represent nonlinear functions of x, we can apply
the linear mo del not to x itself but to a transformed input φ(x), where φ is a
168
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

nonlinear transformation. Equiv


Equivalently
alently
alently,, we can apply the kernel tric
trick
k describ
described
ed in
Sec. 5.7.2, to obtain a nonlinear learning algorithm based on implicitly applying
nonlinear
the transformation.
φ mapping. Equiv
We can think ofalently , we can apply
φ as providing a set the kernel tric
of features k describxed
describing in
, or
Sec.
as 5.7.2, toaobtain
providing a nonlinear learning
new representation for x. algorithm based on implicitly applying
the φ mapping. We can think of φ as providing a set of features describing x, or
The question is then how to choose the mapping φ.
as providing a new representation for x.

1.The
Onequestion
option is is to
thenusehow
a veryto choose
genericthe mapping
φ, such as theφ.infinite-dimensional φ that
is implicitly used by kernel machines based on the RBF kernel. If φ(x ) is
1. of
One option
high enoughis to dimension, we canφalw
use a very generic , such
ays as
always havthe
have capacity to φ
infinite-dimensional
e enough fitthatthe
is implicitly used by kernel machines based on the
training set, but generalization to the test set often remains p oor. Very RBF kernel. If φ ( x ) is
of high enough
generic dimension,are
feature mappings we usually
can alwbased
ays hav e enough
only on thecapacity
principletooffitlo the
local
cal
training
smo
smoothness set, but generalization
othness and do not enco encode to the test set often remains
de enough prior information to solve adv p oor.
advancedV
ancedery
generic
problems. feature mappings are usually based only on the principle of lo cal
smo othness and do not enco de enough prior information to solve advanced
problems.option is to manually engineer φ. Until the adven
2. Another adventt of deep learning,
this was the dominant approach. This approach requires decades of human
2. effort
Another foroption
eac
each h is to manually
separate task, engineer φ. Until the sp
with practitioners adven t of deep
specializing
ecializing in learning,
different
domains such as speech recognition or computer vision, and withhuman
this was the dominant approach. This approach requires decades of little
effort for
transfer b etw eac
etweenh separate
een domains. task, with practitioners sp ecializing in different
domains such as speech recognition or computer vision, and with little
3. The strategy
transfer b etwofeendeep learning is to learn φ. In this approach, we ha
domains. have
ve a momodel del
>
y = f (x ; θ , w ) = φ(x; θ) w. We no noww hav
havee parameters θ that we use to learn
3. φThe strategy of deep learning is to
from a broad class of functions, and parameters learn φ. In this approach,
w that map we from
have aφ(mox )del to
y = desired
the f (x ; θ , w ) = φ(xThis
output. ; θ) iswan . Wexample
e now hav ofe aparameters
deep feedforw θ that
feedforward ardwe netuse
network,
work,to learn
with
φ defining
φ from a broad class lay
a hidden of er.
functions,
layer. and parameters
This approach w that
is the only onemap
of thefrom φ( xthat
three ) to
the
giv esdesired
gives up on output.
the conv This
convexit
exit
exity y isofan
theexample
trainingofproblem,
a deep feedforw
but the ard network,
b enefits outw
outweighwith
eigh
φ defining a hidden
the harms. In this approac lay er.
approach, This approach is the
h, we parametrize the represen only one of the three
tation as φ(x; θ)
representation that
givesuse
and up the
on the convexity algorithm
optimization of the training
to find problem,
the θ thatbutcorresp
the b enefits
onds tooutw
corresponds a go
gooeigh
od
the harms.
represen
representation. In this approac h, w e parametrize the represen
tation. If we wish, this approach can capture the b enefit of the first tation as φ ( x ; θ)
and use hthe
approac
approach byoptimization
b eing highlyalgorithmgeneric—we to finddo the θ that
so by usingcorresp
a very onds
broadto afamily
go o d
represen tation.
φ(x; θ ). This approac If we
approach wish, this approach can capture the
h can also capture the b enefit of the second approac b enefit of the
approach. first h.
approac
Human hpractitioners
by b eing highly can enco generic—we
encodede their do so
knowledge by using
to a
help very broad
generalization family by
φ(x; θ ). This
designing approac
families φ(x ;hθcan) that also capture
they exp ectthe
expect b enefit
will p erformof the
well.second
The adv approac
advanan tageh.
antage
Human
is that thepractitioners
human designer can enco de their
only needsknowledge
to find the to righ
helpt generalization
right general function by
designing families φ (x ; θ ) that they exp
family rather than finding precisely the right function. ect will p erform well. The adv an tage
is that the human designer only needs to find the right general function
family rather than finding precisely the right function.
This general principle of improving mo models
dels by learning features extends b ey eyond
ond
the feedforward net networks
works described in this chapter. It is a recurring theme of deep
This general
learning principle
that applies to allofofimproving
the kindsmo of dels
mo by learning
models
dels describ
described features
ed extends
throughout thisb ey
b oondok.
the feedforward
Feedforward netw net
networks works described in this chapter. It is a recurring
orks are the application of this principle to learning deterministic theme of deep
learning that applies to all of the kinds of mo dels describ ed throughout this b o ok.
Feedforward networks are the application 169of this principle to learning deterministic
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

mappings from x to y that lack feedback connections. Other mo models


dels presen
presented
ted
later will apply these principles to learning sto stocchastic mappings, learning functions
mappings
with fromand
feedback, x tolearning
y that probability
lack feedback connections.
distributions ov
overerOther movector.
a single dels presented
later will apply these principles to learning sto chastic mappings, learning functions
withWfeedback,
e b egin this
andchapter
learning with a simple distributions
probability example of a ov feedforward
er a single net
network.
work. Next,
vector.
we address each of the design decisions needed to deplo deploy y a feedforward netw network.
ork.
Wetraining
First, b egin this chapterard
a feedforw
feedforward withnetw
a simple
network example
ork requires of a feedforward
making many of the netsame
work.design
Next,
we addressaseach
decisions of the design
are necessary for decisions
a linear mo needed
del: cto
model: ho deplo
osingythe
hoosing a feedforward
optimizer, netw ork.
the cost
First, training
function, and the a form
feedforw ardoutput
of the network requires
units. making
We review thesemany of of
basics the same design
gradient-based
decisions as are
learning, then pro necessary
ceed to confront some of the design decisions that arethe
proceed for a linear mo del: cho osing the optimizer, cost
unique
function,
to and the
feedforward form
net
networks.
works.of the output units.
Feedforward net
netwWorks
w e review
ha vethese
have in
intro
trobasics
duced of
troduced gradient-based
the concept of a
learning,
hidden lay then
layer, pro ceed to confront
er, and this requires us to cho some
hooseof the design decisions that
ose the activation functions that are unique
will b e
to feedforward networks.
used to compute the hidden lay Feedforward
layer net w orks ha ve in tro duced the
er values. We must also design the architecture concept ofofa
hidden
the netw lay
network,er, and this requires
ork, including how man manyus to
y lay cho
layers ose the
ers the netw activation
network functions
ork should contain, ho that
howwill
w thesebe
used
net to compute
networks
works should the hidden layer
b e connected to veac
alues.
each We must
h other, and hoalso
howw design the architecture
many units should b e in of
the
eac
eachhnetw
la ork,
layer.
yer. including
Learning how man
in deep y lay
neural ersorks
netw
networksthe requires
network computing
should contain, how these
the gradients of
net works should b e connected to eac
complicated functions. We present the back-pr h other,
ack-prop and
op ho
opagation w many units should
agation algorithm and its mo b e
modernin
dern
eac h layer. Learning in deep neural
generalizations, which can b e used to efficiennetw orks
efficiently requires computing the gradients
tly compute these gradients. Finally of,
Finally,
complicated
w e close withfunctions. We present
some historical p erspthe
erspectiv
ectivback-pr
ective.
e. opagation algorithm and its mo dern
generalizations, which can b e used to efficiently compute these gradients. Finally,
we close with some historical p ersp ective.
6.1 Example: Learning XOR
T6.1
o mak
make Example:
e the idea of aLearning feedforw ardXOR
feedforward netw
networkork more concrete, we b egin with an
example of a fully functioning feedforward net netwwork on a very simple task: learning
T o mak e
the XOR function.the idea of a feedforw ard netw ork more concrete, we b egin with an
example of a fully functioning feedforward network on a very simple task: learning
the The
XORXOR function (“exclusiv
function. (“exclusivee or”) is an op operation
eration on two binary values, x 1
and x2. When exactly one of these binary values is equal to 1, the XOR function
returnsThe 1XOR functionit (“exclusiv
. Otherwise, returns 0.eTheor”) XisOR
an function
op eration on twothe
provides binary
targetvalues,
function x
= fx∗(. xWhen
yand ) that exactly
we wantone to of theseOur
learn. binary
mo
model values
del pro is equal
provides
vides to 1, they XOR
a function = f ( xfunction
; θ) and
our learning algorithm will adapt the parameters θ to make f as similar as function
returns 1 . Otherwise, it returns 0. The X OR function provides the target p ossible
y =
to f . ∗f ( x) that we want to learn. Our mo del provides a function y = f ( x; θ ) and
our learning algorithm will adapt the parameters θ to make f as similar as p ossible
to fIn. this simple example, we will not b e concerned with statistical generalization.
We wan wantt our netw network
ork to p erform correctly on the four p oin oints
ts
tsXX = { [0[0,, 0]> , [0 , 1]> ,
In this
[1,, 0]>, and [1
[1 simple example,
[1,, 1]>} . W we will not b
Wee will train the netw e concerned
ork on all four ofX thesegeneralization.
network with statistical p oin
oints.
ts. The
W e wan t our netw ork to p erform
only challenge is to fit the training set. correctly on the four p oints = [0 , 0] , [0 , 1] ,
[1, 0] , and [1, 1] . We will train the network on all four of these { p oints. The
We can treat this problem as a regression problem and use a mean squared error
only challenge is to} fit the training set.
loss function. We choose this loss function to simplify the math for this example
as m Wuch
e can astreat this problem
p ossible. We willassee
a regression
later that problem andother,
there are use a mean
moresquared error
appropriate
loss function. We choose this loss function to simplify the math for this example
as much as p ossible. We will see later that there are other, more appropriate
170
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

approac
approaches
hes for mo
modeling
deling binary data.
Ev
Evaluated
aluated on our whole training set, the MSE loss function is
approaches for mo deling binary data.
Evaluated on our whole training1 Xset,∗ the MSE loss 2function is
J (θ ) = (f (x) − f (x; θ )) . (6.1)
4
1 x∈X
J (θ ) = (f (x) f (x; θ )) . (6.1)
4
No
Noww we must chohoose
ose the form of our mo − f (x; θ ). Supp
model,
del, Suppose
ose that we cho
hoose
ose
a linear mo
model,
del, with θ consisting of w and b. Our mo model
del is defined to b e
Now we must cho ose the form of our mo del, f (x; θ ). Supp ose that we cho ose
a linear mo del, with θ consisting X
f (x;of = xb>.wOur
w ,wb)and + b.mo del is defined to b e (6.2)

We can minimize J(θ ) in closed f (xform


; w , b)with= x respw +ect
respect b. to w and b using the normal (6.2)
equations.
We can minimize J(θ ) in closed form with resp ect to w and b using the normal
After solving the normal equations, we obtain w = 0 and b = 12 . The linear
equations.
mo
model del simply outputs 0.5 everywhere. Wh Why y do
doeses this happ
happen?
en? Fig. 6.1 shows how
After
a linear mo solving
modeldel is not able to represent the XOR function. and
the normal equations, we obtain w = 0 Onebwa = y .toThe
way solvelinear
this
mo del simply
problem is to outputs
use a mo 0.del
5 everywhere.
model that learnsWh y do es tthis
a differen
different happ en?
feature space Fig. 6.1 shows
in which how
a linear
a
mo linear
model del ismo abledeltoisrepresent
not able to therepresent
solution. the XOR function. One way to solve this
problem is to use a mo del that learns a different feature space in which a linear
mo delSp
Specifically
ecifically
ecifically,
is able to , we will in
represent intro
tro
troduce
theducesolution.a very simple feedforward netw networkork with one
hidden la layer
yer containing two hidden units. See Fig. 6.2 for an illustration of
thisSp mo ecifically
model.
del. This , wefeedforward
will intro duce net
netw waorkvery hassimple
a vector feedforward
of hiddennetw unitsorkh with
that oneare
hidden layer containing t w
(1)o hidden units. See
computed by a function f (x; W , c). The values of these hidden units are then Fig. 6.2 for an illustration of
this mo del. This feedforward
used as the input for a second lay net
layer.w ork has a
er. The second lay vector
layer of hidden units
er is the output lay h that
layer are
er of the
computed
net
network. by a
work. The output layfunction f
layer (x ; W , c) . The v alues
er is still just a linear regression mo of these hidden
model, units
del, but no are
noww then
it is
used as to
applied thehinput
ratherforthana second
to x. The layer.netw Theork
network second layer is the
now contains twooutput
functions layer of the
chained
network. h
together: The
= foutput
(1) (x; W lay
, cer is still
) and y =just f (2)a(hlinear
; w , b ),regression mo del, but
with the complete mo
modelnowb eing
del it is
applied to h rather than
f (x; W , c, w , b) = f (2)(f (1)(x)). to x . The netw ork now contains two functions chained
together: h = f (x; W , c)(1) and y = f (h; w , b ), with the complete mo del b eing
f
f (x; W , c, w , b) = f (f (x))compute?
What function should . (1) Linear mo dels ha
models ve serv
have serveded us well so far,
and it may b e tempting to make f b e linear as well. Unfortunately Unfortunately,, if f(1) were
linear,What then function should f net
the feedforward compute?
work as aLinear
network whole mo woulddelsremain
have serv ed us function
a linear well so far, of
and it may b e tempting to make
its input. Ignoring the intercept terms for the momen f b e linear as well.
moment, Unfortunately
t, supp
suppose (1) , if
ose f (x ) = W x f w ere
>
linear,
and f then
(2) (h ) the
= h feedforward
>w
. Then f net
( x ) work
= w >as W a >whole
x would remain
. We could represen
represent a tlinear function as
this function of
fits(xinput.
) = x>Ignoring
w 0 wherethe = W w. terms for the moment, supp ose f (x ) = W x
w 0 intercept
and f (h) = h w. Then f( x) = w W x. We could represent this function as
f (xClearly
Clearly,
) = x ,wwewhere must use a nonlinear function to describ
w = W w.
describee the features. Most neural
net
networks
works do so using an affine transformation con controlled
trolled by learned parameters,
follo Clearly
followed , w e must use a nonlinear
wed by a fixed, nonlinear function called an activ function to describ e thefunction.
activation
ation features. W Most
e useneural
that
networkshere,
strategy do soby using an affine
defining h = gtransformation
( W >x + c) , where controlled
W pro bvides
y learned
provides parameters,
the weigh
weights ts of a
followed
linear by a fixed, nonlinear
transformation and c the function
biases. called an activ
Previously
Previously, , to ation
describ
describefunction.
e a linear Wregression
e use that
strategy
mo
model, here, by defining
del, we used a vector of weigh h = g
weights( W x + c) , where W pro vides
ts and a scalar bias parameter to describe the weigh ts ofana
linear transformation and c the biases. Previously, to describ e a linear regression
mo del, we used a vector of weights and 171a scalar bias parameter to describe an
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Original x Space Learned h Space

1 Original x Space 1 Learned h Space

h2
x2

0 0

0 1 0 1 2
x1 h1

Figure 6.1: Solving the XOR problem by learning a represen representation.


tation. The b old num numb b ers
prin
printed
ted on the plot indicate the value that the learned function must output at each p oin oint.t.
Figure
(L 6.1: Solving
eft) A linear mo
(Left) model the XOR problem by learning a represen tation.
del applied directly to the original input cannot implement the XOR The b old num b ers
printed onWhen
function. the plot
x1 = indicate
00,, the the
mo valueoutput
model’s
del’s that the learned
must function
increase as x must output When
2 increases. at each x 1p=oin11,
t.,
(Left)
the mo Adel’s
model’slinear mo del
output mustapplied directly
decrease as xto2 increases.
the originalAinput
linearcannot
modelimplement
must applythe a XOR
fixed
function.
co
coefficien
efficien w 2 toxx2=
efficienttWhen . 0The, thelinear
mo del’s
mo
modeloutput must increase
del therefore cannot useas xthe increases.
value of When x = 1,
x 1 to change
moefficient
the co del’s output
coefficient on x2must decrease
and cannot solve x
as this increases.
problem.A(R linear
ight) model
(Right) In the m ust apply aspace
transformed fixed
co efficiented
represen
represented t wbytothex features
. The linear mo del
extracted bytherefore cannot
a neural netw ork,use
network, the value
a linear model x to
of can nochange
now
w solve
the problem.
co efficientInonour x example
and cannot solve this
solution, the tw problem.
two (R ight)
o p oints that must havIn the transformed
havee output 1 hahave space
ve b een
represented
collapsed byathe
into features
single ointt extracted
p oin in feature bspace.
y a neural netww
In other ork,
ords,a linear model can
the nonlinear now solve
features hav
havee
the problem. > two p oints that must have output 1hha
mapp
mapped ed b othInx=our[1
[1,example
, 0] > andsolution,
x = [0 the
[0,, 1] to a single p oin
ointt in feature sp ace, =ve
space, [1 ,b0]
een>
.
collapsed
The linear mointo a
modelsingle p oin
del can now describ t in feature space. In other w ords, the nonlinear features
describee the function as increasing in h 1 and decreasing in h2. hav e
mapp
In b oth x =
thisedexample, the[1,motiv
0] and ationx for
motivation = [0 , 1] to the
learning a single p oin
feature t in is
space feature ace, hthe
spmake
only to = [1mo , 0]del.
model
The linear
capacit
capacity mo delsocan
y greater that now describ
it can e thetraining
fit the function as In
set. increasing in
more realistic h and in h .
decreasinglearned
applications,
In this example,
represen
representations thealso
tations can motiv ation
help thefor
mo learning
model the feature space is only to make the mo del
del to generalize.
capacity greater so that it can fit the training set. In more realistic applications, learned
representations can also help the mo del to generalize.

172
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.2: An example of a feedforward netw network,


ork, drawn in tw twoo different styles. Sp Specifically
ecifically
ecifically,,
this is the feedforw
feedforward
ard netw
network
ork we use to solve the XOR example. It has a single hidden
Figure
la
layer 6.2: An example
yer containing of a feedforward
two units. (L netwstyle,
eft) In this
(Left) ork, drawn in twevery
we draw o different
unit styles.
as a no Spde
node ecifically
in the,
this is the feedforw ard netw ork we use to solve
graph. This style is very explicit and unambiguous but for netthe XOR example. It
networks has a single
works larger than thishidden
layer containing
example two units.
it can consume to
tooo (L eft)
muc
much In this (R
h space. style,
ight)wIn
(Right) e draw everywe
this style, unitdraas
draww aa nonode
de in the
node
graph.for
graph This
eac
eachstyle
h en is vvery
entire
tire ectorexplicit and unambiguous
representing a la
layer’s but
yer’s activ for net
activations.
ations. works
This st larger
style
yle is mucthan
much this
h more
example it can consume to o muc h space. (R ight) In this style, we
compact. Sometimes we annotate the edges in this graph with the name of the parameters dra w a no de in the
graphdescrib
that for eacehthe
describe entire vector representing
relationship b etw
etween
een tw
twoao la
layer’s
yers. activ
layers. ations.
Here, This stthat
we indicate yle isa muc h more
matrix W
compact.
describ
describes Sometimes
es the mapping wefrom
annotate
x to the edgesa in
h , and this graph
vector withesthe
w describ
describes thename
mappingof thefrom
parameters
h to y.
that
W describ eomit
e typically the relationship
the interceptbparameters
etween twoasso layers.
associated Here,
ciated withwe indicate
each lay
layer that alab
er when matrix
labeling W
eling this
describ
kind of es the mapping from x to h , and a vector w describ es the mapping from h to y.
drawing.
We typically omit the intercept parameters asso ciated with each layer when lab eling this
kind of drawing.
affine transformation from an input vector to an output scalar. Now, we describ describee
an affine transformation from a vector x to a vector h , so an en entire
tire vector of bias
affine transformation from
parameters is needed. The activ an input vector to an output
ation function g is typical
activation typically ly chosen to be a functione
scalar. Now, w e describ
an affine
that transformation
is applied element-wise,fromwith h i = gx( xto>W
a vector a vector h , somodern
:,i + ci ). In mo
andern
entire vector
neural of orks,
netw bias
networks,
parameters is needed. The activ ation
the default recommendation is to use the rectifie function g is
ctifiedtypical
d line
linearly chosen to be a function
ar unit or ReLU (Jarrett
that
et al.is applied element-wise,
al.,, 2009; Nair and Hin Hinton with h = g ( x
ton, 2010; Glorot et al.W + c ) . In mo dernby
al.,, 2011a) defined neural netwation
the activ orks,
activation
the default
function g (zrecommendation
) = max{0, z } depictedis to use the r6.3
in Fig. ectifie
. d linear unit or ReLU (Jarrett
et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) defined by the activation
We can
function g (zno
now
)=w max
sp
specify
ecify
0, zourdepicted
completeinnetw
network
Fig.ork 6.3as
.
>
We can now spfecify (x{; Wour
}, c,complete
w , b) = wnetw maxork , W > x + c} + b.
{0as (6.3)

We can no
now fecify
w sp (x; Wa ,solution
specify towthemax
c, w , b) = XOR0,problem.
W x + cLet+ b. (6.3)
 
We can now sp ecify a solution 1 1{ }
Wto=the XOR problem.
, Let (6.4)
1 1
1 1
W = , (6.4)
10 1
c= , (6.5)
−1
 0 
c= 1 , (6.5)
w =  1 , (6.6)
−2
−1
w = 173 , (6.6)
 2 

 
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.3: The rectified linear activ activation


ation function. This activactivation
ation function is the default
activ
activation
ation function recommended for use with most feedforw feedforwardard neural netw
networks.
orks. Applying
Figure
this 6.3: The
function to rectified
the output linear
of aactiv ation
linear function. Thisyields
transformation activation function
a nonlinear is the default
transformation.
activ
Ho
Howevation
wev
wever, function recommended for use with most feedforw ard neural
er, the function remains very close to linear, in the sense that is a piecewisenetw orks. Applying
linear
this function
function withtotwo thelinear
outputpieces.
of a linear transformation
Because rectified linearyieldsunits
a nonlinear
are nearlytransformation.
linear, they
However,
preserv
preserve the yfunction
e man
many remains
of the prop ertiesvery
properties thatclose
maketolinear
linear,mo
indels
modelsthe easy
sensetothat is a piecewise
optimize linear
with gradient-
function with
based metho
methods. two linear pieces.
ds. They also preserve man Because
many rectified
y of the proplinear
propertiesunits are nearly linear,
erties that make linear mo they
models
dels
preserve man
generalize y ofAthe
well. prop erties
common that make
principle linear mo
throughout dels easy
computer to optimize
science is thatwith
we cangradient-
build
based methosystems
complicated ds. They alsominimal
from preservecompmanonen
y of ts.
componen
onents.theMuch
prop erties
as a Tthat
uringmake linearmemory
machine’s mo dels
generalize well. A common principle throughout computer science is that
needs only to b e able to store 0 or 1 states, we can build a universal function approximatorwe can build
complicated
from rectifiedsystems from minimal comp onents. Much as a Turing machine’s memory
linear functions.
needs only to b e able to store 0 or 1 states, we can build a universal function approximator
from rectified linear functions.

174
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

and b = 0.
We can now walk through the wa way
y that the mo model
del pro
processes
cesses a batc
batch
h of inputs.
and b = 0.
Let X b e the design matrix containing all four p oints in the binary input space,
withWonee can now walk
example p erthrough
row: the way that the mo del pro cesses a batch of inputs.
 all four  p oints in the binary input space,
Let X b e the design matrix containing 0 0
with one example p er row: 0 1 
X =  10 00  .
 (6.7)
0 1
X= 1 1 . (6.7)
1 0
The first step in the neural netw network
ork is to 1 multiply
1 the input matrix by the first
la
layer’s
yer’s weight matrix:  
The first step in the neural network isto0multiply 0 the input matrix by the first
layer’s weight matrix:  
 1 1
X W =  10 10 .
 (6.8)
1 1
XW = 2 2 . (6.8)
1 1
Next, we add the bias vector c, to obtain 2 2
   
Next, we add the bias vector c, to obtain 0 −1
 1 0  
 0  1 .  (6.9)
 1 0  
21 −10 . (6.9)
1 0
In this space, all of the examples lie along 2 1 a line with slop slopee 1. As we mov
movee along
 
this line, the output needs to b egin at 0, then rise to 1, then drop bac back
k down to 0.
In linear
A this space,
mo delallcannot
model of theimplement
examples liesucalong
suchh a a line with
function. T o slop
finishe 1computing
. As we mov e along
the value
 
this line,
of h for eacthe
each output needs to
h example, we apply theb eginat 0 , then rise to 1 , then drop back down to 0.
 rectifiedlinear transformation:
A linear mo del cannot implement such a function. To finish computing the value
 
of h for each example, we apply the rectified 0 0 linear transformation:
 1 0 
 0 0 . (6.10)
 1 0 
21 10 . (6.10)
1 0
This transformation has changed the relationship 2 1 b etw
etween
een the examples. They no
 
longer lie on a single line. As shown in Fig. 6.1, they now lie in a space where a
This transformation
linear mo
model
del can solve hasthe
changed
problem.the relationship b etween the examples. They no
 
longer lie on a single line. As shown  in Fig. 6.1, they now lie in a space where a
We finish by multiplying by theweigh 
weightt vector w:
linear mo del can solve the problem.
 
We finish by multiplying by the weigh 0 t vector w :
 1 
 0 . (6.11)
1
01 . (6.11)
1
175
 0 
 
 
 

 
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The neural net


netw
work has obtained the correct answer for ev
every
ery example in the batc
batch.
h.
In this example, we simply sp specified
ecified the solution, then show showed
ed that it obtained
The neural network has obtained the correct answer for every example in the batch.
zero error. In a real situation, there might b e billions of mo model
del parameters and
billions of training examples, so one cannot simply guess the ed
In this example, w e simply sp ecified the solution, then show that itasobtained
solution we did
zero error. In a real situation, there might b e billions of mo del
here. Instead, a gradient-based optimization algorithm can find parameters that parameters and
billions
pro duceofvery
produce training examples,
little error. The so one cannot
solution simply
we describ
described ed guess
to thetheXORsolution as we
problem did
is at a
here. Instead,
global minim
minimum a gradient-based optimization algorithm
um of the loss function, so gradient descent could concan find parameters
converge that
verge to this
pro duce
p oin
oint. very little error.
t. There are other equiv The solution
equivalen
alen w e describ ed to the X OR problem
alentt solutions to the XOR problem that gradient is at a
global
descen minim um of the loss
descentt could also find. The conv function,
convergenceso gradient descent
ergence p oint of gradien could
gradientt descen con
descentt depverge
ends on this
depends to the
p oint. vThere
initial alues ofarethe
other equivalenIn
parameters. t solutions
practice, to the XOR
gradien
gradient problem
t descent would that gradient
usually not
descen
find t could
clean, alsoundersto
easily find. Theo d,
understoo conv ergencealued
integer-v p ointsolutions
integer-valued of gradien t descen
like t dep
the one weends on ted
presen the
presented
initial values of the parameters. In practice, gradient descent would usually not
here.
find clean, easily understo o d, integer-valued solutions like the one we presented
here.
6.2 Gradien
Gradient-Based
t-Based Learning

6.2 Gradien
Designing and training t-Based
a neuralLearning
netw
network
ork is not much differen differentt from training any
other machine learning mo model
del with gradient descen descent. t. In Sec. 5.10, we describ
described ed
Designing
ho
how and training a neural netw
w to build a machine learning algorithm by sp ork is not m uch
specifying differen t from
ecifying an optimization protraining any
procedure,
cedure,
aother
cost machine
function,learning
and a mo mo
model delfamily
del with. gradient descent. In Sec. 5.10, we describ ed
family.
how to build a machine learning algorithm by sp ecifying an optimization pro cedure,
Thefunction,
a cost largest difference
and a mobdel etw
etween
een the
family . linear mo models
dels we havhavee seen so far and neural
net
networks
works is that the nonlinearit
nonlinearity y of a neural netw networkork causes most interesting loss
The largest difference
functions to become non-conv b etw
non-convex. een
ex. This means thatwe
the linear mo dels have seen
neural net so farare
networks
works andusually
neural
networksbyis using
trained that the nonlinearit
iterativ
iterative, y of a neuraloptimizers
e, gradient-based network causes most interesting
that merely drive the cost loss
functionstotoa become
function very lo wnon-conv
low ex. This
value, rather thanmeans
the linear thatequation
neural net worksused
solvers are tousually
train
trained by using
linear regression mo iterativ
models e, gradient-based
dels or the conconvex optimizers that merely
vex optimization algorithms with global conv drive the cost
conver-
er-
function to a v ery low value, rather than the
gence guarantees used to train logistic regression or SVMs. Convlinear equation solvers
Convex used to
ex optimizationtrain
linear
con regression
converges
verges starting mofrom
dels or
any theinitial
convex optimization
parameters (in algorithms
theory—in with global
practice it conv er-
is very
gence guarantees
robust used to ntrain
but can encounter logistic
umerical regression
problems). Sto
Stocor SVMs.
chastic Convex
gradient optimization
descent applied
con verges
to non-conv
non-convexstarting from any initial
ex loss functions has no such convparameters (in
convergence theory—in practice it
ergence guarantee, and is sensitive is very
robust but can encounter n umerical problems). Sto
to the values of the initial parameters. For feedforward neural net chastic gradient descent
networks, applied
works, it is
to non-conv
imp
importan
ortan
ortantt toexinitialize
loss functions has ts
all weigh
weights notosuch conv
small ergencevalues.
random guarantee,
The and
biasesis sensitive
ma
may y be
to the values of the initial parameters.
initialized to zero or to small p ositiv F or feedforward
ositivee values. The iterativ neural net works,
iterativee gradient-based opti- it is
imp ortanalgorithms
mization t to initialize usedalltoweigh
traintsfeedforward
to small random netw
networks
orksvalues. The biases
and almost all othermadeep
y be
initialized
mo
models to zero
dels will b e describor
describedto small p ositiv e values. The iterativ e gradient-based
ed in detail in Chapter 8, with parameter initialization opti-
in
mization algorithms
particular discussed usedin Sec.to train
8.4. Ffeedforward
or the moment, networks and almost
it suffices all other deep
to understand that
mo dels
the will b
training e describis
algorithm edalmost
in detail inysChapter
alwa
always based on8,usingwith theparameter
gradientinitialization
to descend the in
particular
cost functiondiscussed
in one in way Sec. 8.4. For the
or another. Themoment,
sp
specific it suffices toare
ecific algorithms understand
impro
improv that
vements
the training
and refinemen
refinementsalgorithm is almost alwa ys based
ts on the ideas of gradient descent, in on using the
intro
tro gradient
troduced to descend
duced in Sec. 4.3, and, the
cost function in one way or another. The sp ecific algorithms are improvements
and refinements on the ideas of gradient 176 descent, intro duced in Sec. 4.3, and,
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

more spspecifically
ecifically
ecifically,, are most often improv
improvements
ements of the sto stochastic
chastic gradien
gradientt descent
algorithm, introduced in Sec. 5.9.
more sp ecifically, are most often improvements of the sto chastic gradient descent
We can of course, train mo models
dels such as linear regression and supp support
ort vector
algorithm, introduced in Sec. 5.9.
mac
machines
hines with gradien
gradientt descen
descentt to
too,
o, and in fact this is common when the training
W e can of course, train mo dels
set is extremely large. From this p oint suchofasview,
linear regression
training and supp
a neural net ort visector
network
work not
mac
m uchhines witht gradien
differen
different t descenany
from training t toother
o, andmo indel.
model.factComputing
this is common when theis training
the gradient sligh
slightly
tly
set is extremely large. From
more complicated for a neural netw this p
network,oint of view, training a neural network
ork, but can still b e done efficiently and exactly is not
exactly..
m uch differen t from
Sec. 6.5 will describ training
describee ho
how any other mo del.
w to obtain the gradien Computing the gradient
gradientt using the back-propagation is sligh tly
more complicated
algorithm and mo for
modern a neural netw ork, but can still b e done
dern generalizations of the back-propagation algorithm.efficiently and exactly .
Sec. 6.5 will describ e how to obtain the gradient using the back-propagation
As withand
algorithm other machine
mo dern learning mo
generalizations models,
ofdels,
the to apply gradien
gradient-based
back-propagation t-based learning we
algorithm.
must chohoose
ose a cost function, and we must choose how to represent the output of
As
the mo with
model. other
del. W e no
nowmachine
w revisit learning
these designmo dels, to apply gradien
considerations with spt-based
special learning we
ecial emphasis on
m ust cho ose
the neural netwa cost
networks function,
orks scenario. and we must choose how to represent the output of
the mo del. We now revisit these design considerations with sp ecial emphasis on
the neural networks scenario.

An imp
important
ortant asp
aspect
ect of the design of a deep neural net network
work is the choice of the
cost function. Fortunately
ortunately,, the cost functions for neural netw networks
orks are more or less
An imp ortant asp ect of the design
the same as those for other parametric mo of a deep
models,neural
dels, suc
suchh as linear is
net work mothe
models.
dels.choice of the
cost function. Fortunately, the cost functions for neural networks are more or less
In most cases, our parametric mo del defines a distribution p ( y | x; θ ) and
model
the same as those for other parametric mo dels, such as linear mo dels.
we simply use the principle of maximum likelihoo likelihood.d. This means we use the
In
cross-en most
cross-entropy cases,
tropy b etw our
etween parametric mo del defines
een the training data and the mo a distribution
model’s p ( y asx;the
del’s predictions θ ) cost
and
w e simply
function. use the principle of maximum likelihoo d. This means w
| e use the
cross-entropy b etween the training data and the mo del’s predictions as the cost
Sometimes, we take a simpler approach, where rather than predicting a complete
function.
probabilit
probability y distribution ov er y , we merely predict some statistic of y conditioned
over
Sometimes,
on x. Sp w
Specialized e take a simpler
ecialized loss functions allowapproach,
us towhere
trainrather than predicting
a predictor a complete
of these estimates.
probability distribution over y , we merely predict some statistic of y conditioned
The total cost function used to train a neural net network
work will often combine one
on x. Sp ecialized loss functions allow us to train a predictor of these estimates.
of the primary cost functions describdescribeded here with a regularization term. We hav havee
The total cost function used to train a neural net work
already seen some simple examples of regularization applied to linear mo will often combine
models one
dels in Sec.
of the. The
5.2.2
5.2.2. primary
weighcost
weight functions
t decay describ
approach usededfor
here withmo
linear a dels
regularization
models term.applicable
is also directly We have
already seen some
to deep neural netw simple
networks examples of regularization applied to linear mo
orks and is among the most p opular regularization strategies. dels in Sec.
5.2.2. adv
More Theanced
weighregularization
advanced t decay approach used forfor
strategies linear monetw
neural dels is
networksalsowill
orks directly applicable
b e describ
describeded in
to deep
Chapter 7.neural netw orks and is among the most p opular regularization strategies.
More advanced regularization strategies for neural networks will b e describ ed in
Chapter 7.

Most mo
modern
dern neural net
networks
works are trained using maxim
maximumum lik
likeliho
eliho
elihooo d. This means
that the cost function is simply the negative log-likelihoo
log-likelihood, equiv
d, equivalen
alen tly describ
alently described
ed
Most mo dern neural networks are trained using maximum likeliho o d. This means
177
that the cost function is simply the negative log-likelihoo d, equivalently describ ed
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

as the cross-en
cross-entropy
tropy b et etwween the training data and the mo modeldel distribution. This
cost function is giv given
en by
as the cross-entropy b etween the training data and the mo del distribution. This
cost function is given bJy(θ ) = −E , ∼p̂ log pmodel(y | x). (6.12)
E
The sp ecific form ofJ the
specific (θ ) cost
= function changes log p from (y mo x)del
. to mo
model model,
del, dep (6.12)
depending
ending
on the sp specific
ecific form of log pmodel − . The expansion of the | ab
abo ove equation typically
The sp ecific form of
yields some terms that do not dep the cost function
depend changes
end on the mo from
model mo del
del parametersto mo del,and depma
ending
mayy be
on the sp ecific form of log p . The expansion of the ab ove
discarded. For example, as we saw in Sec. 5.5.1, if pmodel (y | x) = N (y ; f (x; θ) , I ), equation typically
yieldswesome
then recov terms
recoverer the thatmeandosquared
not dep end cost,
error on the mo del parameters and may b e
discarded. For example, as we saw in Sec. 5.5.1, if p (y x) = (y ; f (x; θ) , I ),
then we recover theJmean 1
(θ ) = squared
E , ∼p̂ error||ycost,
− f (x; θ )||2 + const | , N
const, (6.13)
2
1E
J (θ ) =1 y f (x; θ ) + const, (6.13)
up to a scaling factor of 2 and 2 a term that do doeses not depdepend
end on θ . The discarded
constan
constantt is based on the variance of the || Gaussian
− ||
distribution, which in this case
up to a scaling factor of
we chose not to parametrize. Previously and a term that do es not dep
Previously,, we saw that the equiv end on θ .alence
The discarded
equivalence betw
between
een
constanum
maxim
maximum t islik
based
likeliho
eliho
elihoo on
o d the variancewith
estimation of thean Gaussian distribution,
output distribution andwhich in this case
minimization of
w e chose not to parametrize. Previously , we
mean squared error holds for a linear model, but in fact, the equiv saw that the equiv alence
equivalence betw
alence holdseen
maxim um likeliho o d estimation with an output
regardless of the f (x; θ ) used to predict the mean of the Gaussian. distribution and minimization of
mean squared error holds for a linear model, but in fact, the equivalence holds
An adv
regardless advantage
antage
of the f of (x;this approach
θ ) used of deriving
to predict the mean the of
costthefunction
Gaussian. from maximum
lik
likeliho
eliho
elihoo o d is that it remov
removes es the burden of designing cost functions for each mo model.
del.
Sp An
Specifying adv
ecifying a moantage
model of this approach of deriving the cost function
del p(y | x ) automatically determines a cost function log p (y | x ). from maximum
likeliho o d is that it removes the burden of designing cost functions for each mo del.
One recurring theme throughout neural netw network
ork design is that the gradien gradientt of
Sp ecifying a mo del p(y x ) automatically determines a cost function log p (y x ).
the cost function must b e large and predictable enough to serve as a go od guide
good
One recurring theme | throughout neural network design is that the gradien| t of
for the learning algorithm. Functions that saturate (b (become
ecome very flat) undermine
the cost
this ob function
objectiv
jectiv must b e large and
jectivee b ecause they make the gradien predictable enough
gradientt b ecome very to serve
small.asIna many
go od guide
cases
for the
this happ learning
happens algorithm.
ens b ecause the activ Functions
activation that saturate
ation functions used to pro (b ecome
produce very flat) undermine
duce the output of the
this ob jectiv
hidden unitse or b ecause they make
the output units the gradienThe
saturate. t b ecome
negativ very small. In many
negativee log-likelihoo
log-likelihood d helpscases
to
this happ ens b ecause the
avoid this problem for many models. Man activ ation functions
Many used to
y output units invpro duce
involvolv the output of
olvee an exp function the
hidden units or the output
that can saturate when its argumen units saturate. The negativ e
argumentt is very negative. The log function log-likelihoo d helps to
in the
a void
negativ this problem
negativee log-lik
log-likeliho
eliho for many models.
elihooo d cost function undo Man
undoes y output units inv olv e an
es the exp of some output units. We will exp function
that can saturate
discuss the interaction b etwhen its
etwe
we argumen
ween t is very negative.
en the cost function and the The choice logoffunction in the
output unit in
negativ
Sec. e log-lik
6.2.2 . eliho o d cost function undo es the exp of some output units. We will
discuss the interaction b etween the cost function and the choice of output unit in
One un unusual
usual prop
propertert
erty y of the cross-entrop
cross-entropy y cost used to p erform maximum
Sec. 6.2.2.
lik
likeliho
eliho
elihooo d estimation is that it usually do doeses not ha have
ve a minimum value when applied
One
to the mo undels commonly used in practice. For cost
models usual prop ert y of the cross-entrop y usedoutput
discrete to p erform maximum
variables, most
likeliho
mo delsoare
models d estimation
parametrized is that initsuch
usually
a wa do
way y es notthey
that havecannot
a minimum value awhen
represent applied
probability
to zero
of the mo or dels
one,commonly
but can come usedarbitrarily
in practice.close Fortodiscrete
doing so. output variables,
Logistic most
regression
mo dels are parametrized
is an example of such a mo in such
model. a wa y
del. For real-v that
real-valued they cannot represent
alued output variables, if the mo a probability
model
del
of zero or one, but can come arbitrarily close to doing so. Logistic regression
is an example of such a mo del. For real-v 178 alued output variables, if the mo del
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

can control the density of the output distribution (for example, by learning the
variance parameter of a Gaussian output distribution) then it b ecomes p ossible
canassign
to control the density
extremely highofdensity
the output
to thedistribution (for example,
correct training by learning
set outputs, resultingthe
in
variance
cross-en parameter
cross-entropy of a Gaussian
tropy approaching negativ output
negativee infinitydistribution) then
infinity.. Regularization tec it b ecomes
techniques p ossible
hniques describ
described
ed
to assign extremely high
in Chapter 7 provide sev density
several to the
eral different wa correct
waysys of motraining
modifying set outputs, resulting
difying the learning problem so in
cross-en
that thetropy
mo delapproaching
model cannot reapnegativ e infinity
unlimited reward. Regularization
in this wa
wayy. techniques describ ed
in Chapter 7 provide several different ways of mo difying the learning problem so
that the mo del cannot reap unlimited reward in this way.

Instead of learning a full probabilit


probability y distribution p(y | x ; θ ) we often wan wantt to learn
just one conditional statistic of y giv given
en x.
Instead of learning a full probability distribution p(y x ; θ ) we often want to learn
For example, we may hav havee a predictor f (x ; θ) that we wish to predict the mean
just one conditional statistic of y given x. |
of y .
For example, we may have a predictor f (x ; θ) that we wish to predict the mean
If we use a sufficien
sufficiently tly p owerful neural netw network,ork, we can think of the neural
of y .
net
network
work as b eing able to represent an anyy function f from a wide class of functions,
withIfthis
we use
classa bsufficien tly ponly
eing limited owerful neural netw
by features suc
suchhork, we can
as contin
continuit uitthink
uityy andof the neural
b oundedness
networkthan
rather as bbeing
y ha able
vingtoa represent
having any function
specific parametric f from
form. From a wide
this class
pointofoffunctions,
view, we
with this class b eing limited only by features suc h as contin
can view the cost function as being a functional rather than just a function. A uit y and b oundedness
rather thanisbay mapping
functional having a from specific parametric
functions to realform.
num
numb Fbrom
ers. this
We can pointth of
us view,
thus think weof
can viewas
learning the cost function
choosing as being
a function rather a than
functional
merely rather
choosingthanajust a function.
set of parameters. A
functional
W e can designis a mapping from functions
our cost functional to havtoe real
have numb ers. occur
its minimum We can at th us think
some sp
specificof
ecific
learning as
function wechoosing
desire. Faorfunction
example, rather
we canthandesign
merely thechoosing a set of parameters.
cost functional to hav
havee its
W e
minim can
minimum design our cost functional to
um lie on the function that maps x to the exp hav e its minimum occur at
ected value of y giv
expected some spen
givenecific
x.
functionanwoptimization
Solving e desire. Forproblem
example, withweresp
canect
respect design the costrequires
to a function functional to have its
a mathematical
minim
to
tool um lie
ol called on theoffunction
calculus variations
variations,that mapsed
, describ
described x to the exp
in Sec. ected
19.4.2 . Itvisalue y given to
notofnecessary x.
Solving an optimization
understand calculus of problem
variations withto resp ect to a function
understand the conten requires
content a mathematical
t of this chapter. A Att
to ol called c alculus of variations , describ ed in Sec. 19.4.2
the moment, it is only necessary to understand that calculus of variations ma . It is not necessary
may y btoe
understand
used to derive calculus of variations
the following to understand the content of this chapter. At
two results.
the moment, it is only necessary to understand that calculus of variations may b e
Our first result derived using calculus of variations is that solving the optimiza-
used to derive the following two results.
tion problem
Our first result derived f ∗ =using
arg mincalculus
E , ∼pof variations
||y − f (x is)||
that
2 solving the optimiza-
(6.14)
tion problem f
E
yields f = arg min y f (x) (6.14)
f ∗ (x) = E ∼p || −
(y |x)[y ],
|| (6.15)
yields
so long as this function lies within E
f (x)the = class we optimize [y ], ov er. In other words,(6.15)
over. if we
could train on infinitely man many y samples from the true data-generating distribution,
minimizing the mean squared errorthe
so long as this function lies within class
cost we optimize
function gives a ov er. In other
function words, if the
that predicts we
could train on infinitely
mean of y for each value of x. man y samples from the true data-generating distribution,
minimizing the mean squared error cost function gives a function that predicts the
mean of y for each value of x. 179
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Differen
Differentt cost functions give differen
differentt statistics. A second result deriv derived
ed using
calculus of variations is that
Different cost functions give different statistics. A second result derived using
calculus of variations isf ∗that
= arg min E , ∼p ||y − f (x)||1 (6.16)
f
E
f = arg min y f (x) (6.16)
yields a function that predicts the value of y for eac eachh x , so long as such a
|| − ||
function may b e describ
describeded by the family of functions we optimize over. This cost
yields a function
function that predicts
is commonly called me the
an absolute verr
mean alueor..of y for each x , so long as such a
error
or
function may b e describ ed by the family of functions we optimize over. This cost
Unfortunately
Unfortunately,, mean squared error and mean absolute error often lead to p o or
function is commonly called mean absolute error.
results when used with gradient-based optimization. Some output units that
Unfortunately
saturate pro duce,very
produce meansmall
squared error and
gradients whenmean absolute
combined witherror often
these lead
cost to p o or
functions.
results
This when
is one usedthat
reason withthe gradient-based
cross-entrop
cross-entropy optimization.
y cost function is Some more poutput units mean
opular than that
saturateerror
squared pro duce very absolute
or mean small gradients
error, ev when
even
en whencombined
it is notwith these cost
necessary functions.
to estimate an
This
en is one reason that the
tire distribution p(y | x).
entire cross-entrop y cost function is more p opular than mean
squared error or mean absolute error, even when it is not necessary to estimate an
entire distribution p(y x).
|
The choice of cost function is tightly coupled with the choice of output unit. Most
of the time, we simply use the cross-entrop
cross-entropy y b et
etween
ween the data distribution and the
The
mo
model choice of cost function is tightly coupled
del distribution. The choice of how to represent with thethe
choice of output
output unit. Most
then determines
of the time, w e simply
the form of the cross-entropuse
cross-entropythe cross-entrop
y function. y b etween the data distribution and the
mo del distribution. The choice of how to represent the output then determines
the An
Any y kind
form of cross-entrop
of the neural netw
network
york unit that may b e used as an output can also b e
function.
used as a hidden unit. Here, we fo focus
cus on the use of these units as outputs of the
mo An
model, y kind of neural netw ork unit
del, but in principle they can b e used thatinternally
may b e used as an
as well. Weoutput
revisit can also
these be
units
used as a hidden unit.
with additional detail ab Here,
about we fo cus on the use of these units
out their use as hidden units in Sec. 6.3. as outputs of the
mo del, but in principle they can b e used internally as well. We revisit these units
Throughout this section, we supp suppose
ose that the feedforw
feedforward
ard net
network
work pro
provides
vides a
with additional detail ab out their use as hidden units in Sec. 6.3.
set of hidden features defined by h = f (x; θ). The role of the output lay layer
er is then
Throughout this section, we supp ose that the feedforw ard
to provide some additional transformation from the features to complete the net work provides a
task
set ofthe
that hidden
netw features
ork mustdefined
network by h = f (x; θ). The role of the output layer is then
p erform.
to provide some additional transformation from the features to complete the task
that the network must p erform.

One simple kind of output unit is an output unit based on an affine transformation
with no nonlinearity
nonlinearity.. These are often just called linear units.
One simple kind of output unit is an output unit based on an affine transformation
features h,.aThese ŷˆ = W > h+ b.
a vector y
withGiv
Given
noennonlinearity la
layer
yer of linear
are oftenoutput units linear
just called pro
produces
duces
units.
Linear outputhlay
layers
ers are often used to pro produce
duce the mean of a conditional
Given features , a layer of linear output units pro duces a vector yˆ = W h+ b.
Gaussian distribution:
Linear output layers are poften
(y | xused
) = Nto(y ;pro duce
yˆ, I ). the mean of a conditional
(6.17)
Gaussian distribution:
p(y x) =180 (y ; yˆ, I ). (6.17)
| N
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Maximizing the log-likelihoo


log-likelihood
d is then equivequivalent
alent to minimizing the mean squared
error.
Maximizing the log-likelihoo d is then equivalent to minimizing the mean squared
The maximum likelihoo
likelihoodd framew
framework ork mak
makeses it straigh
straightforward
tforward to learn the
error.
co
cov
variance of the Gaussian to too,
o, or to mak makee the cov
covariance
ariance of the Gaussian b e a
The maximum likelihoo
function of the input. How d
However, framew
ever, the cov ork
ariance must bstraigh
mak
covariance es it tforward
e constrained to to
b e learn thee
a p ositiv
ositive
covariance
definite of the
matrix for Gaussian
all inputs.toIto,isordifficult
to makto e the covariance
satisfy suc
such of the ts
h constrain
constraintsGaussian be a
with a linear
functionla
output of thesoinput.
layer,
yer, How
typically ever,output
other the covunits
ariancearemust
usedbto e constrained
parametrizetotheb ecov
a pariance.
ositive
covariance.
definite matrix
Approac
Approaches
hes to mo for deling
all inputs.
modeling It is
the cov difficult
covariance
ariance aretodescrib
satisfyedsuc
described h constrain
shortly
shortly, ts with
, in Sec. a linear
6.2.2.4 .
output layer, so typically other output units are used to parametrize the covariance.
Because
Approac linear
hes to units the
mo deling do not saturate,
covariance arethey pose
describ edlittle difficult
difficulty
shortly y for
, in Sec. gradien
gradient-
6.2.2.4 . t-
based optimization algorithms and ma may y b e used with a wide variety of optimization
Because linear units do not saturate, they pose little difficulty for gradient-
algorithms.
based optimization algorithms and may b e used with a wide variety of optimization
algorithms.

Man
Many y tasks require predicting the value of a binary variable y . Classification
problems with twtwoo classes can b e cast in this form.
Many tasks require predicting the value of a binary variable y . Classification
The maximum-lik
maximum-likeliho
eliho
elihood
od approach is to define a Bernoulli distribution over y
problems with two classes can b e cast in this form.
conditioned on x.
The maximum-likeliho od approach is to define a Bernoulli distribution over y
A Bernoulli
conditioned on xdistribution
. is defined by just a single num numb b er. The neural net
needs to predict only P ( y = 1 | x). For this num numb b er to b e a valid probability
probability,, it
A Bernoulli distribution
must lie in the in
interv
terv
terval
al [0, 1].is defined by just a single num b er. The neural net
needs to predict only P ( y = 1 x). For this numb er to b e a valid probability, it
Satisfying this constraint requires some careful design effort. Supp Suppose
ose we were
must lie in the interval [0, 1]. |
to use a linear unit, and threshold its value to obtain a valid probabilit
probability:
y:
Satisfying this constraint requires some
n careful
n design effort.
oo Supp ose we were
to use a linear unit,
P (and
y = threshold its value
1 | x) = max 0, minto obtain
1, w ha+valid
>
b probabilit
. y: (6.18)

This would indeed Pdefine (y = a1 valid


x) =conditional
max 0, min 1, w h +but
distribution, b we . would not b(6.18)
e able
to train it very effectiv
effectively |
ely with gradient descent. Any time that w>h + b stra strayed
yed
This would indeed
outside the unit intervdefine
interval, a v alid conditional distribution, but
al, the gradient of the output of the mo we
model would not
del with resp b e
respect able
ect to
to train it very effectiv ely with gradient descent. Any time
its parameters would b e 0. A gradientnof 0 is ntypically problematic that w h + b
b ecause yed
stra the
outside the unit interv al, the gradient of the output of the oo
mo del with resp ect to
learning algorithm no longer has a guide for how to impro improv ve the corresponding
its parameters would b e 0. A gradient of 0 is typically problematic b ecause the
parameters.
learning algorithm no longer has a guide for how to improve the corresponding
Instead, it is b etter to use a different approach that ensures there is alwa alwaysys a
parameters.
strong gradien
gradientt whenever the mo model
del has the wrong answ
answer.
er. This approach is based
on using sigmoid output units combined with maximum ensures
Instead, it is b etter to use a different approach that lik
likeliho
eliho
elihooothere
d. is always a
strong gradient whenever the mo del has the wrong answer. This approach is based
A sigmoid output unit is defined by
on using sigmoid output units combined with maximum likeliho o d.
A sigmoid output unit is defined
yŷˆ = σbyw>h + b (6.19)

yˆ = σ 181
w h+b (6.19)
 

 
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

where σ is the logistic sigmoid function describ


described
ed in Sec. 3.10.
We can think of the sigmoid output unit as having tw twoo comp
componen onen
onents.
ts. First, it
where σ is the logistic sigmoid function >
describ ed in Sec. 3.10.
uses a linear lay
layer
er to compute z = w h + b. Next, it uses the sigmoid activ activation
ation
W e can think
function to conv
convertof the sigmoid output
ert z into a probability
probability.. unit as having tw o comp onen ts. First, it
uses a linear layer to compute z = w h + b. Next, it uses the sigmoid activation
We omit the dep endence on x for the moment to discuss how to define a
dependence
function to convert z into a probability.
probabilit
probabilityy distribution ov er y using the value z . The sigmoid can b e motiv
over motivated
ated
W e omit the dep endence on x
by constructing an unnormalized probabilit for the
probability moment
y distribution P to discuss
˜ how
P̃ ( y ), whic
whichto define
h do
does
es nota
probabilit y distribution ov er y using the v alue z . The sigmoid
sum to 1. We can then divide by an appropriate constant to obtain a valid can b e motiv ated
˜( y ), which do es not
by constructing
probabilit
probability an unnormalized
y distribution. If we b eginprobabilit y distribution
with the assumption thatPthe unnormalized log
sum to 1. W
probabilities e can
are linearthen
in y divide
and z , by an appropriate
we can exponentiateconstant to obtainto theobtain a valid
unnormalized
probability distribution.
probabilities. If we b egintowith
We then normalize see the
thatassumption
this yieldsthat the unnormalized
a Bernoulli log
distribution
probabilities
controlled byare
controlled a sigmoidal y z
linear in transformation
and , we can exponentiate
of z : to obtain the unnormalized
probabilities. We then normalize to see that this yields a Bernoulli distribution
log P˜(y ) = y z of z :
controlled by a sigmoidal transformation (6.20)
˜ (6.21)
log P
P˜((yy)) =
= exp(
yz yz) (6.20)
exp(y z )
P˜(y ) = exp(
P1 y z ) (6.21)
(6.22)
=0 exp(
exp(y
y exp( y 0 z)
yz)
P ( y ) =
P (y ) = σ ((2y − 1)zy) z. ) (6.22)
(6.23)
exp(
Probabilit
Probability y distributions basedPon (y )exp
=σ ((2tiation
exponen
onen y 1)z )and
onentiation . normalization are common (6.23)
throughout the statistical mo modeling
deling literature. − The z variable defining suc suchh a
Probabilit y
distribution ov distributions
over based on exp
er binary variables is called onen tiation and normalization are common
throughout the statistical mo deling literature. P a lo logit
git
git..
The z variable defining such a
This approach to predicting the probabilities in log-space is natural to use
distribution over binary variables is called a logit.
with maximum likelihoo
likelihood d learning. Because the cost function used with maxim maximumum
lik This
likeliho
eliho
elihoo approach to predicting the probabilities
o d is − log P ( y | x), the log in the cost function undo in log-space
undoes is natural to use
es the exp of the
with maximum likelihoo d learning. Because the cost
sigmoid. Without this effect, the saturation of the sigmoid could prev function used with
entmaxim
prevent um
gradient-
based o d is log
likeliholearning P ( ymaking
from x), thegoo
good log in the costThe
d progress. function
loss function for exp
undo es the of the
maxim
maximumum
sigmoid.
lik
likeliho
eliho
elihooo dWithout
− this
learning of aeffect, the saturation
|Bernoulli parametrizedof the
by sigmoid
a sigmoid could
is prevent gradient-
based learning from making goo d progress. The loss function for maximum
likeliho o d learning of a BernoulliJ (θ ) =parametrized
− log P (y | xby ) a sigmoid is (6.24)
= − log σ ((2y − 1)z ) (6.25)
J (θ ) = log P (y x) (6.24)
=− ζ ((1 − 2y )z ) . (6.26)
= log σ ((2|y 1)z ) (6.25)

This deriv
derivation
ation mak
makeses use of=some ζ−((1 prop )z )−. from Sec. 3.10. By rewriting
2yerties
properties (6.26)
the loss in terms of the softplus function, − we can see that it saturates only when
(1 − 2y )z is very negative. Saturation thus oerties
This deriv ation mak es use of some prop ccurs from Sec. 3.10
only when the .mo
Bydelrewriting
model already
the loss in terms
has the right answ of the softplus function, we
er—when y = 1 and z is very p ositiv
answer—when can see that
e, or y = 0 and z iswhen
ositive, it saturates only very
(1 2
negativy
negative.)z is very negative. Saturation thus o ccurs only when
e. When z has the wrong sign, the argument to the softplus function, the mo del already
has−the right answer—when y = 1 and z is very p ositive, or y = 0 and z is very
negative. When z has the wrong sign, 182the argument to the softplus function,
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

(1
(1−− 2 y )z , may b e simplified to |z |. As |z | b ecomes large while z has the wrong sign,
the softplus function asymptotes tow toward
ard simply returning its argumen argumentt |z |. The
(1
deriv2 y )
derivativez , may
ative with respb e simplified
respect to z . As z
ect to z asymptotes to sign b ecomes
sign((z), so, in the limitthe
large while z has ofwrong
extremelysign,
the− softplus
incorrect function
z , the softplus asymptotes
function | |do tow
es |ard
does not simplythe
| shrink returning
gradientits
gradien t atargumen z
all. Thist prop . The
property
erty
deriv ative with resp ect to z asymptotes
is very useful b ecause it means that gradien to sign
gradient-based ( z ), so, in the limit
t-based learning can act to quic of extremely
| |
quickly
kly
incorrecta zmistaken
correct , the softplusz . function do es not shrink the gradient at all. This prop erty
is very useful b ecause it means that gradient-based learning can act to quickly
When we use other loss functions, such as mean squared error, the loss can
correct a mistaken z .
saturate anytime σ(z ) saturates. The sigmoid activ activation
ation function saturates to 0
When we use
when z b ecomes very negativother loss functions, such as
negativee and saturates to 1 when mean squared
z b ecomeserror,very
the ploss can
ositiv
ositive.e.
saturate
The gradienanytime
gradient σ(z ) saturates.
t can shrink to
tooo small to The b esigmoid
useful for activ ation function
learning wheneversaturates
this happ
happens,to 0
ens,
when z bthe
whether ecomes
mo delvery
model has negativ e andanswer
the correct saturatesor the to incorrect
1 when zanswer.b ecomesFor very
thisp ositiv
reason,e.
The
maxim gradien
maximum um lik t eliho
can
likeliho
elihoo shrink to o small
o d is almost alwa toysb ethe
always useful for learning
preferred approach whenever this happ
to training ens,
sigmoid
whether the
output units. mo del has the correct answer or the incorrect answer. For this reason,
maximum likeliho o d is almost always the preferred approach to training sigmoid
Analytically
Analytically,, the logarithm of the sigmoid is alwa always ys defined and finite, b ecause
output units.
the sigmoid returns values restricted to the op open en interv
interval (0,, 1)
al (0 1),, rather than using
Analytically ,
the entire closed interv the logarithm
interval of the sigmoid
al of valid probabilities [0 is alwa
[0,, 1] ys defined
1].. In softw
softwareareand finite, b ecause
implementations,
theav
to sigmoid
avoid returns problems,
oid numerical values restricted to the
it is b est op en interv
to write al (0, 1)
the negativ
negative e ,log-likelihoo
rather thandusing
log-likelihood as a
the entireofclosed
function intervthan
z, rather al of asvalid probabilities
a function of yŷˆ[0=, 1]σ. (In
z ).softw are sigmoid
If the implementations,
function
to avoidws
underflo
underflows numerical
to zero, then problems,
takingitthe is blogarithm
est to write of yŷˆthe negativ
yields e log-likelihoo
negativ
negative e infinit
infinityy. d as a
z
function of , rather than as a function of y
ˆ = σ (z ). If the sigmoid function
underflows to zero, then taking the logarithm of yˆ yields negative infinity.

An
Anyy time we wish to represen
representt a probability distribution ov over
er a discrete variable
with n p ossible values, we ma may y use the softmax function. This can b e seen as a
An y time we wish
generalization to sigmoid
of the representfunction
a probability
whic
which h distribution
was used to ov er a discrete
represen
represent variabley
t a probabilit
probability
with n p ossible
distribution ov ervalues,
over a binarywevma y use the softmax function. This can b e seen as a
ariable.
generalization of the sigmoid function which was used to represent a probability
Softmax functions are most often used as the output of a classifier, to represen representt
distribution over a binary variable.
the probabilit
probability
y distribution ov
overer n differen
differentt classes. More rarely
rarely,, softmax functions
can Softmax functions
b e used inside the are
mo most
model oftenifused
del itself, as the
we wish theoutput
mo deloftoacclassifier,
model ho
hoose
ose b et to
etw
weenrepresen
one oft
thedifferent
n probability distribution
options for someov iner n differen
internal
ternal t classes. More rarely, softmax functions
variable.
can b e used inside the mo del itself, if we wish the mo del to cho ose b etween one of
In the case of binary variables, we wished to pro produce
duce a single numnumb b er
n different options for some internal variable.
In the case of binary variables,
yŷˆ =we
P (wished
y=1|x to).pro duce a single numb er (6.27)

Because this number needed to yˆlie= bPet(yween


= 10x
etween ). 1, and b ecause we wan
and (6.27)
wanted
ted the
logarithm of the number to b e well-b
ell-beha
eha
ehaved
ved| for gradien
gradient-based
t-based optimization of
Because
the this number
log-likelihoo d, weneeded
log-likelihood, to instead
chose to lie b etween 0 and
predict 1, and
a num
numbb erbzecause
= logwP˜e(ywan
P̃ = ted
1 | the
x).
logarithm
Exp
Exponen
onen of
onentiatingthe n umber to b e w
tiating and normalizing gavell-b eha ved for gradien t-based optimization
gavee us a Bernoulli distribution controlled by the of
the log-likelihoo d, we chose to instead predict a num b er z = log ˜ (y = 1 x).
P
sigmoid function.
Exp onentiating and normalizing gave us a Bernoulli distribution controlled by| the
sigmoid function. 183
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

To generalize to the case of a discrete variable with n values, we now need


to pro duce a vector y
produce ŷˆ, with yˆi = P (y = i | x ). We require not only that eac eachh
T
elemeno generalize to
elementt of yˆi b e b et the
etween case of a discrete v ariable with n values, we now
ween 0 and 1, but also that the entire vector sums to 1 so that need
to pro duce a v ector yˆ, with yˆy distribution.
it represents a valid probabilit
probability = P (y = i x ). W
The e require
same not that
approach only wthat
orkedeac h
for
elemen t of y
ˆ b e b etween 0 and 1 , but also that
|
the Bernoulli distribution generalizes to the multinoulli the entire vector sums to 1 so that
distribution. First, a linear
it
la yer predicts unnormalized log probabilities: The same approach that worked for
represents
layer a v alid probabilit y distribution.
the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear
z = W > h + b,
layer predicts unnormalized log probabilities: (6.28)

where z i = log PP̃˜( y = i | x) . The z= W h+


softmax b,
function can then expexponentiate (6.28)
onentiate and
normalize z to obtain the desired yˆ. Formally
ormally,, the softmax function is given by
where z = log P˜( y = i x) . The softmax function can then exp onentiate and
normalize z to obtain the| desired yˆ. Formally exp(
exp(z zi) softmax function is given by
, the
softmax(z )i = P . (6.29)
j exp(
exp(zzj )
exp(z )
softmax(z ) = . (6.29)
exp( z )
As with the logistic sigmoid, the use of the exp function works very well when
training the softmax to output a target value y using maximum log-likelihoo d. In
log-likelihood.
As with the logistic sigmoid, the use of the exp function
this case, we wish to maximize log P (y = i ; z ) = log softmax softmax((z )i. Definingwhen
works very well the
training the softmax to output a target
softmax in terms of exp is natural b ecauseP value y using maximum
the log in the log-likelihoo log-likelihoo
log-likelihood d.
d can undoIn
this case, we wish
the exp of the softmax:to maximize log P (y = i ; z ) = log softmax ( z ) . Defining the
softmax in terms of exp is natural b ecause the log in the log-likelihoo d can undo
X
the exp of the softmax: log softmax(z )i = zi − log exp(
exp(zz j ). (6.30)
j
log softmax(z ) = z log exp(z ). (6.30)
The first term of Eq. 6.30 sho ws that−the input zi alw
shows ays has a direct con-
always
tribution to the cost function. Because this term cannot saturate, we know that
learningThe first
can proterm
proceed, of even
ceed, Eq. 6.30
if thesho ws that theof input
contribution z alw
z i to the ays has
second terma of
direct
Eq. con-
6.30
tribution to the cost function. Because this termX cannot saturate, we know that
b ecomes very small. When maximizing the log-likelihoo log-likelihood, d, the first term encourages
zlearning can pro ceed, even if the contribution of z to the second
z to term of Eq.down.
6.30
i to b e pushed up, while the second term encourages P all of b e pushed
T boecomes very intuition
gain some small. When maximizing
for the the log-likelihoo
second term, log j exp exp((d,
z j)the first eterm
, observ
observe thatencourages
this term
zcantobbeeroughly
pushedapproximated
up, while the by second encourages all of z
maxj z j. This approximation is based on thedown.
term to b e pushed idea
T o gain
that exp some intuition for the second term, log exp ( z ), observ
exp((z k ) is insignificant for any z k that is noticeably less than max j z j. Thee that this term
incan b
intuitione roughly
tuition we canapproximated
gain from this max z . This approximation
byapproximation is based
is that the negative on the
log-lik idea
log-likeliho
eliho
elihoood
that function
cost exp(z ) is alwainsignificant
always ys stronglyfor any z the
p enalizes thatmost
is noticeably
activ less than
activee incorrect max z . IfThe
prediction. the
intuitionanswer
correct we canalready
gain from has this
the approximation
largest input toisthe thatsoftmax,
the negative
then log-lik
the −zeliho od
P i term
cost the
function P
and log alwa j exp
ys( zstrongly
exp( p enalizes the most active incorrect prediction. If the
j ) ≈ maxj z j = zi terms will roughly cancel. This example
will then contribute little to the largest
correct answer already has the ov
overall input tocost,
erall training the softmax,
whic
which h willthen the z term
b e dominated by
and the
other log
examples exp
that( zare) notmax yetzcorrectly
= z terms will roughly cancel. This−example
classified.
will then contribute little ≈ to the overall training cost, which will b e dominated by
So far we havhavee discussed only a single example. Ov Overall,
erall, unregularized maximum
other examples that are not yet correctly classified.
lik
likeliho
eliho
elihooo d will drive the mo model
del to learn parameters that drive the softmax to predict
So far we have discussed only a single example. Overall, unregularized maximum
P
likeliho o d will drive the mo del to learn parameters 184 that drive the softmax to predict
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

the fraction of councounts ts of each outcome observed in the training set:


Pm
the fraction of counts of each outcome observed j=1 1yin =i,x
the training
=x set:
softmax(z (x; θ )) i ≈ Pm . (6.31)
j=1 1x =x
1
softmax(z (x; θ )) . (6.31)
Because maximum lik likeliho
eliho
elihooo d is a consistent estimator, 1 this is guaranteed to happ happen en

so long as the mo model
del family is capable of represen representing
ting the training distribution. In
Because maximum
practice, limited mo likdel
modeleliho o d is a consistent
capacity and imp estimator,
imperfect
erfect this is guaranteed
optimization will meantothathapptheen
so long as the mo del family is capable of P
represen ting the training distribution. In
mo
modeldel is only able to approapproximate
ximate these fractions.
practice, limited mo del capacity and imp erfect P optimization will mean that the
Man
Many y ob
objectiv
jectiv
jectivee functions other than the log-likelihoolog-likelihood d do not work as well
mo del is only able to approximate these fractions.
with the softmax function. Sp Specifically
ecifically
ecifically,, ob jectivee functions that do not use a log to
objectiv
jectiv
undo Manthey exp
ob jectiv
of thee functions
softmax fail other than when
to learn the log-likelihoo
the argumen
argument dtdoto not workb ecomes
the exp as well
with the softmax function.
very negative, causing the gradien gradientt to vanish. In particular, squared errorlogistoa
Sp ecifically , ob jectiv e functions that do not use a
undo
p the exp
oor loss of theforsoftmax
function softmaxfail to learn
units, and can whenfailthetoargumen
train thet tomo the
del exp
model b ecomes
to change its
voutput,
ery negative,
even whencausing
the mothedel
model gradien
makest highly
to vanish. In particular,
confiden
confident squared error
t incorrect predictions is a,
(Bridle
p oor).loss
1990
1990). To function
understand for softmax
wh
why y these units,
otherand losscan fail to train
functions the we
can fail, moneed
del totochange
examine its
output,
the even function
softmax when theitself.
mo del makes highly confident incorrect predictions (Bridle,
1990). To understand why these other loss functions can fail, we need to examine
Lik
Likee the sigmoid, the softmax activ ation can saturate. The sigmoid function has
activation
the softmax function itself.
a single output that saturates when its input is extremely negative or extremely
Likee.the
p ositiv
ositive. Insigmoid,
the casethe ofsoftmax activation
the softmax, therecanaresaturate.
multipleThe sigmoid
output function
values. has
These
a single output that saturates when its
output values can saturate when the differences betw input is extremely
between negative or extremely
een input values b ecome
extreme. When the softmax saturates, many cost functionsoutput
p ositiv e. In the case of the softmax, there are multiple based on values. These
the softmax
output
also values unless
saturate, can saturate
they arewhenable totheinv differences
invert betweenactiv
ert the saturating input values
activating
ating b ecome
function.
extreme. When the softmax saturates, many cost functions based on the softmax
alsoTsaturate,
o see thatunless
the softmax
they are function
able toresp
responds
invertonds tosaturating
the the differenceactivbating
et
etwween its inputs,
function.
observ
observee that the softmax output is inv invarian
arian
ariantt to adding the same scalar to all of its
T
inputs:o see that the softmax function resp onds to the difference b etween its inputs,
observe that the softmax output softmax( is zinv
)= arian t to adding
softmax( z + c)the
. same scalar to all(6.32) of its
inputs:
Using this proppropert
ert
ertyy, we can derive za)numerically
softmax( = softmax(zstable + c). variant of the softmax: (6.32)
Using this prop erty, wesoftmax(
can derive
z) =a softmax(
numerically
z −stable
max ziv)ariant
. of the softmax:
(6.33)
i

The reformulated version softmax(


allows zus
)=tosoftmax(
ev aluatezsoftmax
evaluate max zwith (6.33)
). only small numerical
errors even when z con tains extremely large or −
contains extremely negativ
negativee num
numbb ers. Ex-
The reformulated version allows us to ev aluate softmax with only small
amining the numerically stable variant, we see that the softmax function is driven numerical
errors
b y the even
amountwhen z con
that its tains extremely
arguments large
deviate or extremely
from maxi z i . negative numb ers. Ex-
amining the numerically stable variant, we see that the softmax function is driven
An output softmax
softmax((z )i saturates to 1 when the corresp
corresponding
onding input is maximal
by the amount that its arguments deviate from max z .
(zi = max i zi ) and zi is much greater than all of the other inputs. The output
An output
softmax
softmax( ( z)i cansoftmax (z ) saturates
also saturate to 1 when
to 0 when the corresp
zi is not maximal onding input
and the is maximal
maximum is
(mz uch
= max z ) and z is m uch greater than all of the other inputs. The
greater. This is a generalization of the way that sigmoid units saturate, and output
softmax( z) can also saturate to 0 when z is not maximal and the maximum is
much greater. This is a generalization of 185the way that sigmoid units saturate, and
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

can cause similar difficulties for learning if the loss function is not designed to
comp
compensate
ensate for it.
can cause similar difficulties for learning if the loss function is not designed to
The argumen
argumentt to the softmax function can b e pro produced
duced in tw twoo different wa ways.
ys.
comp ensate for it.z
The most common is simply to hav havee an earlier lay layerer of the neural net network
work output
ev
everyThe argumen t z
ery element of z, as describto the softmax
described ed ab function can b
aboove using the linear laye pro duced
layer in tw o> different
er z = W h + b . While ways.
The most
straigh commonthis
straightforward,
tforward, is simply
approach to hav e an earlier
actually ov layer of the neural
overparametrizes
erparametrizes network output
the distribution. The
ev ery element of z , as describ ed ab o ve using the linear
constraintt that the n outputs must sum to 1 means that only n − 1 parameters are
constrain lay er z = W h + b . While
straigh
necessary;tforward, this approach
the probability of theactually
n -th valueoverparametrizes
may b e obtained theby distribution.
subtracting The the
constrain t that the n outputs m ust
first n − 1 probabilities from 1. We can thus imp sum to 1 means
impose that only
ose a requiremen n 1 parameters
requirementt that one element are
necessary;
of z b e fixed. the Fprobability
or example, of wthe n -threquire
e can value may that bzne = obtained
0. Indeed, by subtracting
− this is exactly the
first nthe1sigmoid
what probabilities
unit do from . We canPthus
es. 1Defining
does. (y =imp1 | osex) =a σ requiremen
(z ) is equiv t that
alentone
equivalent element
to defining
of(yz =
P b− e1 fixed.
| x) = Fsoftmax
or example,
softmax( we can
(z )1 with a tw require
two-dimensionalthat z z=and
o-dimensional 0. Indeed, this isthe
z1 = 00.. Both exactly
n −1
what the sigmoid
argumentt and the n argumen
argumen unit do es. Defining P (y = 1 x ) =
argumentt approaches to the softmax can describ σ ( z ) is equiv alent to defining
describee the same
set 1 x) = softmax
P (yof=probability (z ) withbut
distributions, a tw
havo-dimensional
have e different| z and zdynamics.
learning = 0. BothInthe n 1
practice,
argumen
there |t and m
is rarely theuch n difference
argument bapproaches
et
etw
ween using to the
the ov softmax
overparametrizedcan describ
erparametrized e the or
version same
−the
set of probability distributions, but hav
restricted version, and it is simpler to implement the ov e different learning dynamics.
overparametrized In
erparametrized version. practice,
there is rarely much difference b etween using the overparametrized version or the
From a neuroscientific p oin ointt of view, it is in interesting
teresting to think of the softmax as
restricted version, and it is simpler to implement the overparametrized version.
a wa
wayy to create a form of comp competition
etition b etw
etween
een the units that participate in it: the
Fromoutputs
softmax a neuroscientific
alw
alwaysays sum p oin
tot 1ofsoview, it is interesting
an increase in the value to think
of oneofunit
the softmax
necessarily as
a way toonds
corresp
corresponds create
to aa decrease
form of comp in the etition
valuebofetwothers.
een theThis unitsisthat participate
analogous to thein lateral
it: the
softmax outputs alw ays sum
inhibition that is b elieved to exist b et to 1 so an
etwincrease
ween nearb
nearby in the v alue of one unit
y neurons in the cortex. At the necessarily
corresp onds
extreme (when to athedecrease
differencein theb et vween
alue of
etween theothers.
maximal This ai isandanalogous
the others to the lateral
is large in
inhibition that is b elieved to exist
magnitude) it b ecomes a form of winner-take-al b et w een nearb y neurons in the
winner-take-alll (one of the outputs is nearly 1 cortex. At the
extreme (when the
and the others are nearly 0). difference b et ween the maximal a and the others is large in
magnitude) it b ecomes a form of winner-take-al l (one of the outputs is nearly 1
The name “softmax” can b e somewhat confusing. The function is more closely
and the others are nearly 0).
related to the argmax function than the max function. The term “soft” derives
from Thethenamefact “softmax”
that the softmaxcan b e somewhat
function confusing.
is con
continuous
tinuous Theand function is more closely
differentiable. The
related to the argmax function than the max function.
argmax function, with its result represented as a one-hot vector, is not con The term “soft” derives
continuous
tinuous
from the fact that the softmax
or differentiable. The softmax function thus pro function is con tinuous
vides a “softened” version ofThe
provides and differentiable. the
argmax function,
argmax. The corresp with
corresponding its result
onding represented
soft version of theasmaximum
a one-hotfunction vector, isis not continuous
softmax
softmax( (z ) > z.
or would
It differentiable.
p erhapsThe softmax
b e better to function
call the thussoftmaxprovidesfunctiona “softened”
“softargmax,”versionbut of the
the
argmax.
curren
current The corresp
t name is an en onding
entrenched
trenchedsoftconversion
conven
ven of the maximum function is
vention.
tion. softmax ( z ) z.
It would p erhaps b e better to call the softmax function “softargmax,” but the
current name is an entrenched convention.

The linear, sigmoid, and softmax output units describ


described
ed ab
aboove are the most
common. Neural net
networks
works can generalize to almost any kind of output lay
layer
er that
The linear, sigmoid, and softmax output
we wish. The principle of maximum likelihoounits
likelihood
d prodescrib
provides ed ab ove are the most
vides a guide for how to design
common. Neural networks can generalize to almost any kind of output layer that
we wish. The principle of maximum likelihoo d provides a guide for how to design
186
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

a go
gooo d cost function for nearly any kind of output lay layer.
er.

a go o d cost function for nearly any kind of output layer.y | x


In general, if we define a conditional distribution p ( ; θ ), the principle of
maxim
maximum um liklikeliho
eliho
elihooo d suggests we use − log p(y | x; θ ) as our cost function.
In general, if we define a conditional distribution p (y x; θ ), the principle of
maximIn general,
um likeliho we can think of the
o d suggests we neural
use log net
network
pwork as; θrepresenting
) as our a function f( x; θ).
(y x | cost function.
The outputs of this function are not direct predictions of the value y. Instead,
− network| as representing a function f( x; θ).
f (xIn
;θ )general,
= ω pro wevides
can think
provides of the neural
the parameters for a distribution ov overer y. Our loss function
The outputs of this function are
can then b e interpreted as − log p(y ; ω (x)). not direct predictions of the value y. Instead,
f (x ;θ ) = ω provides the parameters for a distribution over y. Our loss function
can Fthen
or example, we ma
b e interpreted may yaswishlogtoplearn
(y ; ω (the
x)).variance of 2a conditional Gaussian for
y , giv
givenen . In the simple case, where the variance σ is a constant, there is a
x
F or example, we mab − to learn the variance of a conditional Gaussian for
yecause
wish
closed form expression the maximum likelihoo likelihood d estimator of variance is
ysimply
, giventhe x . empirical
In the simple case, where the
mean of the squared difference b etvariance σ is
etween
weena constant,
observ
observations there
ations y isanda
closed
their exp form
expected expression b ecause the
ected value. A computationally more exp maximum likelihoo
expensiv
ensiv d estimator
ensivee approach that do of v ariance
doeses not is
simply the
require empirical
writing sp mean ofco
special-case
ecial-case the
codede is squared
to simply difference
includeb et theween observas
variance one yof and
ations the
their
prop exp ected value. A computationally
erties of the distribution p( y | x) that is con
properties more exp ensiv
controlled e approach that
trolled by ω = f (x ; θ ). The do es not
require
negativeewriting
negativ log-lik sp ecial-case
log-likeliho
eliho
elihoo o d − log pco(yde
; ωis(xto)) simply
will then include
pro
provide thea vcost
vide ariance as one
function withof the
the
prop erties of the distribution
appropriate terms necessary to mak p ( y x ) that is con
makee our optimization pro trolled b y
procedureω = f (x
cedure incremen ; θ ).
incrementally The
tally
negativ e log-lik eliho o d log p (y ; ω (x
| )) will then
learn the variance. In the simple case where the standard deviation do pro vide a cost function
does with
es not dep
dependthe
end
appropriate
on the input,terms we can necessary
− a new
make to mak e our optimization
parameter in the netwnetwork prothat
ork cedure incremen
is copied tally
directly
learn
into ωthe
into . Thisvariance. In the simple
new parameter might case
b e where
σ itselfthe standard
or could b e adeviation
parameter dovesrepresen
not depting
representing end
on the input, we can make a
σ 2 or it could b e a parameter β represen new parameter
representing in the
ting σ1 , dep netw
depending ork that is copied
ending on how we choose to directly
into ω . This new parameter might
parametrize the distribution. We may wish our mo b e σ itself or could
model b
del to e apredict
parameter v represen
a differen
different ting
t amount
σ vor
of it could
ariance in byefor
a parameter
different vβalues represen
of x .ting
This ,isdep calledending on how
a heter
heteroscosc we choose
osceedastic mo
model. to
del.
parametrize
In the distribution.
the heteroscedastic case, w W eesimply
may wish make ourthemo spdel to predict of
specification
ecification a differen t amount
the variance be
of variance in y for different v alues of x . This is called
one of the values output by f ( x; θ). A typical way to do this is to formulate the a heter osc e dastic mo del.
In the heteroscedastic
Gaussian distribution using case, w e simply make
precision, ratherthe than sp ecification
variance, as of described
the variance be
in Eq.
one
3.22..ofInthe
3.22 thevalues
multiv output
multivariate by fit
ariate case ( xis
; θmost
). A common
typical wto ayuse to do this is toprecision
a diagonal formulate the
matrix
Gaussian distribution using precision, rather than variance, as described in Eq.
3.22. In the multivariate case it is most diag common
(β ). to use a diagonal precision matrix (6.34)

This form
formulation
ulation works well with gradien diag(βt).descen
gradient descentt b ecause the formula for (6.34)
the
log-lik
log-likeliho
eliho
elihooo d of the Gaussian distribution parametrized by β in involv
volv
volves
es only mul-
This formulation
tiplication works
by β i and well with
addition gradien
of log t descen
βi . The t b ecause
gradient the formulaaddition,
of multiplication, for the
log-lik
and eliho o d of
logarithm op the Gaussian
operations
erations distribution
is well-behav
well-behaved.
ed. Byparametrized
comparison,by if β
weinparametrized
volves only mul-
the
tiplication by β and addition of log β . The gradient of multiplication,
output in terms of variance, we would need to use division. The division function addition,
and
b logarithm
ecomes op erations
arbitrarily steepisnear
well-behav
zero. ed. By comparison,
While if wecan
large gradients parametrized the
help learning,
output in terms
arbitrarily large ofgradients
variance,usually
we would needintoinstability
result use division.
. If The
instability. division function
we parametrized the
b ecomes arbitrarily steep near zero. While large
output in terms of standard deviation, the log-likelihoo gradients
log-likelihood can
d would still inv help
involvelearning,
olve division,
arbitrarily large
and would also inv gradients
involv
olv usually result in instability. If we parametrized
olvee squaring. The gradient through the squaring operation the
output in terms of standard deviation, the log-likelihoo d would still
can vanish near zero, making it difficult to learn parameters that are squared. involve division,
and would also involve squaring. The gradient through the squaring operation
can vanish near zero, making it difficult 187 to learn parameters that are squared.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Regardless of whether we use standard deviation, variance, or precision, we must


ensure that the cov covariance
ariance matrix of the Gaussian is positive definite. Because
Regardless
the eigenv alues of the we
of
eigenvalues whether use standard
precision matrixdeviation,
are the recipro variance,
reciprocals cals orof precision,
the eigenv we
eigenvalues must
alues of
ensure
the cov that
covariance the cov ariance
ariance matrix, this is equivmatrix of
equivalen
alenthe Gaussian is positive definite.
alentt to ensuring that the precision matrix is Because
the eigenv
p ositiv alues of the precision matrix
ositivee definite. If we use a diagonal matrix, are the or arecipro
scalar timescals ofthe thediagonal
eigenvalues
matrix,of
the cov
then theariance matrix, this
only condition is equiv
we need alent toonensuring
to enforce the output thatofthetheprecision
mo del is pmatrix
model ositivityis.
ositivity.
p ositiv
If e definite.
we supp
supposeose that If we
a isusethe
a diagonal
ra
raww activ matrix,
activation
ation of or the
a scalar
mo
model times
del usedthetodiagonal
determine matrix,
the
then the only condition we need to enforce on the output
diagonal precision, we can use the softplus function to obtain a p ositiv of the mo ositivee precision.
del is p ositivity
If w e supp ose that a is the ra w activ ation of the
vector: β = ζ( a). This same strategy applies equally if using variance mo del used to determine
or standardthe
diagonal precision,
deviation rather than we can use theorsoftplus
precision if usingfunction
a scalartotimes obtainidena ptity
ositiv
identity e precision
rather than
vdiagonal
ector: β matrix.
= ζ( a). This same strategy applies equally if using variance or standard
deviation rather than precision or if using a scalar times identity rather than
It is rare to learn a cov covariance
ariance or precision matrix with richer structure than
diagonal matrix.
diagonal. If the cov covariance
ariance is full and conditional, then a parametrization must
It is rare
b e chosen thattoguaran
learn atees
guarantees covpariance
ositiv or precision of
ositive-definiteness
e-definiteness matrix with richer
the predicted cov structure
covariance than
ariance matrix.
diagonal.
This can b eIfachiev
the cov
achieved ed ariance
by writingis full
Σ(xand) = conditional,
B (x)B> (x) ,then where aBparametrization
is an unconstrained must
b e chosen
square that guaran
matrix. tees p ositiv
One practical issuee-definiteness
if the matrixofisthe fullpredicted
rank is that covariance
computing matrix.
the
This
lik can
likeliho
eliho
elihoo b e achiev
o d is exp ed
expensiv
ensiv by
ensive, writing Σ ( x ) = B ( x )B ( x ) , where 3 B
e, with a d × d matrix requiring O(d ) computation for the is an unconstrained
square matrix.
determinan
determinant t andOne invpractical
erse of Σissue
inverse if the
( x) (or matrix
equiv
equivalently
alently
alently,is, full
andrank moreiscommonly
that computing done, the
its
likeliho o
eigendecomp d is
eigendecompositionexp ensiv e, with a
osition or that of B (x)).d d matrix requiring O ( d ) computation for the
determinant and inverse of Σ( x)×(or equivalently, and more commonly done, its
We often osition
eigendecomp want toorp erform
that ofmultimodal
B (x)). regression, that is, to predict real values
that come from a conditional distribution p ( y | x) that can ha have
ve sev
several
eral differen
differentt
Weinoften
p eaks want for
y space to pthe
erform
same multimodal
value of xregression,
. In this case, that is, to predict mixture
a Gaussian real values is
conditional distribution p ( y x
a natural representation for the output (Jacobs et al., 1991; Bishop, 1994). t
that come from a ) that can ha ve sev eral differen
p eaks in
Neural y space
netw
networks
orks withfor the
Gaussian value of xas
same mixtures . their
In| this case, are
output a Gaussian
often called mixture
mixtur
mixture ise
a natural representation for the output
networks.. A Gaussian mixture output with n comp
density networks ( Jacobs et al. , 1991
componen ;
onen
onentsBishop , 1994
ts is defined by ).
Neural
the networksprobability
conditional with Gaussian mixtures as their output are often called mixture
distribution
density networks. A Gaussian mixture output with n comp onents is defined by
n
the conditional probability X distribution
p(y | x) = p(c = i | x)N (y ; µ (i) (x), Σ (i)(x)). (6.35)
i=1
p(y x) = p(c = i x) (y ; µ (x), Σ (x)). (6.35)
The neural net
network
work must hav havee three outputs: a vector defining p ( c = i | x ), a
| | N
matrix providing µ(i) (x) for all i, and a tensor providing Σ (i)( x) for all i. These
The neural
outputs mustnetsatisfy
work must e three outputs: a vector defining p ( c = i x ), a
havconstraints:
different
matrix providing µ (x) forX all i, and a tensor providing Σ ( x) for all i. |These
outputs must satisfy
1. Mixture comp different
components
onents p( cconstraints:
= i | x): these form a multinoulli distribution
over the n differen
differentt comp
componen
onen
onents
ts asso
associated
ciated with latent variable c, and can
1. Mixture comp onents p( c = i x): these form a multinoulli distribution
Weoconsider
ver the cntodifferen
be latent because
t comp we do
onen ts| not
assoobserve
ciateditwith
in thelatent
data: given input c, and
variable andtarget
can
, it is not possible to know with certainty which Gaussian component was responsible for , but
we can imagine that was generated by picking one of them, and make that unobserved choice a
random variable.

188
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

typically b e obtained by a softmax ov er an n-dimensional vector, to guarantee


over
that these outputs are p ositive and sum to 1.
typically b e obtained by a softmax over an n-dimensional vector, to guarantee
2. Means
that theseµ(i)(outputs
x ): these areindicate
p ositive the andcenter
sum toor1.mean asso ciated with the i-th
associated
Gaussian comp component,onent, and are unconstrained (typically with no nonlinearity
2. at
Means µ ( x ) :
all for these output these units).
indicateIf y theis center
a d -v or mean
-vector,
ector, then asso ciated
the netw orkwith
network mustthe i-th
output
Gaussian comp
an n × d matrix con onent, and are unconstrained (typically
taining all n of these d-dimensional vectors. Learning
containing with no nonlinearity
at all for these
these means with maximoutput units).
maximum um Iflik yeliho
is a do-v
likeliho
elihoo d ector, then the
is slightly more netw ork must output
complicated than
an n d the
learning matrixmeans conof taining all n of these
a distribution with donly
-dimensional
one output vectors.
mo de. Learning
mode. We only
these
w means
ant ×to up
update withthe
date maxim
meanum forlik eliho
the o d is
comp
componenonenslightly
onent t thatmore complicated
actually pro ducedthan
produced the
learning
observ
observation. the means of a distribution
ation. In practice, we do not kno with
know only
w which comp one output
componen onen mo
onentt prode.
produced W e
duced eaconly
eachh
w ant
observ to
observation.up date the mean for the comp
ation. The expression for the negative log-likelihoo onen t that
log-likelihoodactually pro duced
d naturally weights the
observ
eac
each ation. Incontribution
h example’s practice, wetodothe not kno
loss forweach
which compcomp
componentonentonen t pro
by the duced each
probability
observ
that ation.
the comp The
componen onen
onentexpression
t pro
produced
duced forthetheexample.
negative log-likelihoo d naturally weights
each example’s contribution to the loss for each comp onent by the probability
3. Co
Covvariances
that the comp Σ(onen
i)(x
)t: pro
these sp
specify
duced ecify
the the cov
covariance
example. ariance matrix for each comp componen
onen
onentt
i. As when learning a single Gaussian comp componen onen
onent, t, we typically use a diagonal
3. matrix
Covariances
to avavoidΣ ( x ) : these sp ecify
oid needing to compute determinan the cov ariance
determinants. ts. matrix
As withfor each comp
learning onent
the means
i. As
of thewhen
mixture,learning maxim a single
maximum Gaussiandcomp
um likelihoo
likelihood onent, we typically
is complicated by needinguse ato diagonal
assign
matrix to
partial resp av oid
responsibilityneeding
onsibility for eac to compute
each h p oin determinan
ointt to eaceach ts. As
h mixture comp with learning
componen
onen
onent. the means
t. Gradient
of the
descen mixture, maxim
descentt will automatically follo um likelihoo
follow d is complicated
w the correct pro process by needing
cess if given the tocorrect
assign
partial
sp resp onsibility
specification
ecification of the negativefor each p oint to eac
log-likelihoo
log-likelihood h mixture
d under comp onen
the mixture mo t.del.
model. Gradient
descent will automatically follow the correct pro cess if given the correct
It hasspbeen
ecification
rep
reported
orted of the
thatnegative
gradien log-likelihoo
gradient-based d under the
t-based optimization of mixture mo del.
conditional Gaussian
mixtures (on the output of neural netw networks)
orks) can b e unreliable, in part b ecause one
It has
gets been rep
divisions (byorted
the that
variance)gradien t-based
which can optimization
b e numerically of unstable
conditional (whenGaussian
some
mixtures (on the output of neural netw orks) can b e unreliable,
variance gets to b e small for a particular example, yielding very large gradients). in part b ecause one
gets divisions (by
One solution is to clip gr the v ariance)
gradients which can b e numerically unstable
adients (see Sec. 10.11.1) while another is to scale the (when some
vgradien
ariance
gradients gets to b e small
ts heuristically (Murray for a particular
and Laro example,
Larochelle
chelle , 2014 yielding
). very large gradients).
One solution is to clip gradients (see Sec. 10.11.1) while another is to scale the
Gaussian mixture outputs are particularly effective in generativ generativee mo models
dels of
gradients heuristically (Murray and Laro chelle, 2014).
sp
speec
eec
eech
h (Sc
Schuster
huster, 1999) or mo mov vements of physical ob objects
jects (GravGraveses, 2013). The
Gaussian mixture
mixture density strategy giv outputs
gives are particularly
es a way for the netw effective
network in generativ
ork to represent multiple e mo dels
outputof
sp
mo eec
desh and
modes (Schuster , 1999the
to control ) orvmo vements
ariance of itsof output,
physicalwhich ob jects is (crucial
Gravesfor
, 2013 ). The
obtaining
amixture density
high degree of strategy
quality in givthese
es a wreal-v
ay foralued
real-valued the netw ork to An
domains. represent
example multiple output
of a mixture
mo desy and
densit
density net to control
network
work is shown theinvariance
Fig. 6.4of . its output, which is crucial for obtaining
a high degree of quality in these real-valued domains. An example of a mixture
In ygeneral,
densit network weismamay y wish
shown in to contin
continue
Fig. 6.4.ue to mo del larger vectors y con
model containing
taining more
variables, and to imp imposeose ric
richerher and richer structures on these output variables. For
In general,
example, we mamaywe may wish
y wish for our to contin
neuralue netwto mo
network orkdelto larger
outputvectors y conof
a sequence taining more
characters
vthat
ariables,
forms anda tosen imp
sentence. ose ric
tence. Inherthese
and richer
cases, structures
w
wee ma
may y conon tinue
these output
continue to use vthe
ariables. For
principle
example, we ma
of maximum likelihoo y wish
likelihood for our neural netw ork
d applied to our model p(y ; ω( x )) to output a sequence
)),, but the mo of
modelc haracters
del we use
that forms a sentence. In these cases, we may continue to use the principle
of maximum likelihoo d applied to our189 model p(y ; ω( x )), but the mo del we use
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.4: Samples drawn from a neural netw network


ork with a mixture density output la layer.
yer.
The input x is sampled from a uniform distribution and the output y is sampled from
Figure
pmodel (y6.4:
| x)Samples drawn
. The neural from
netw orkaisneural
network able tonetw orknonlinear
learn with a mixture
mappingsdensity
from output layer.
the input to
The input x is sampled from a uniform distribution and the output y
the parameters of the output distribution. These parameters include the probabilitiesis sampled from
p verning
go (y xwhich
governing ). Theofneural
three netw ork iscomponents
mixture able to learn nonlinear
will generatemappings
the outputfrom asthe
wellinput to
as the
the parameters
parameters | for eachof themixture
output component.
distribution. Each
Thesemixture
parameters include isthe
component probabilities
Gaussian with
governing mean
predicted whichand of three mixture
variance. All ofcomponents
these asp willofgenerate
aspects
ects thedistribution
the output output as well as the
are able to
parameters
v for each
ary with respect mixture
to the input component.
x, and to do Each mixture component
so in nonlinear wa
ways.
ys. is Gaussian with
predicted mean and variance. All of these asp ects of the output distribution are able to
vary with respect to the input x, and to do so in nonlinear ways.
to describe y b ecomes complex enough to be b ey eyond
ond the scope of this chapter.
Chapter 10 describ
describes
es how to use recurrent neural netw
networks
orks to define such mo
models
dels
to describe y b ecomes complex enough
over sequences, and Part I I I describ
describes to
es adv be
advancedb eyond the scope
anced techniques for mo of this
modeling chapter.
deling arbitrary
Chapter 10
probabilit
probability describ es how
y distributions. to use recurrent neural netw orks to define such mo dels
over sequences, and Part I I I describ es advanced techniques for mo deling arbitrary
probability distributions.
6.3 Hidden Units

6.3far wHidden
So e ha
have
ve fo Units
focused
cused our discussion on design choices for neural netw networks
orks that
are common to most parametric machine learning mo models
dels trained with gradient-
So far we have fo cused our discussion on design
based optimization. Now we turn to an issue that is unique choices for to
neural networks
feedforward that
neural
are
net common
networks:
works: ho
howwtotomost
cho parametric
hoose
ose the type machine
of hiddenlearning mo dels
unit to use trained
in the hiddenwith
la gradient-
layers
yers of the
based
mo
model.
del.optimization. Now w e turn to an issue that is unique to feedforward neural
networks: how to cho ose the type of hidden unit to use in the hidden layers of the
The design of hidden units is an extremely active area of research and do does
es not
mo del.
yet ha
have
ve man
many y definitiv
definitivee guiding theoretical principles.
The design of hidden units is an extremely active area of research and do es not
yet Rectified
have manlinear unitse are
y definitiv an excellen
guiding excellentt default
theoretical choice of hidden unit. Many other
principles.
typ
ypes
es of hidden units are av available.
ailable. It can b e difficult to determine when to use
whic
whichRectified linear units are an
h kind (though rectified linear excellen t default
units choice an
are usually of hidden unit.choice).
acceptable Many other
We
typ es of hidden units are available. It can b e difficult to determine when to use
which kind (though rectified linear units 190 are usually an acceptable choice). We
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

describ
describee here some of the basic intuitions motiv motivating ating each type of hidden units.
These intuitions can b e used to suggest when to try out each of these units. It is
describ eimp
usually here
impossiblesometoofpredict
ossible the basic in advintuitions
advance
ance which motiv willating
workeach b est.type
Theofdesignhiddenpro units.
process
cess
These intuitions can
consists of trial and error, in b e used to suggest
intuiting when to try out each
tuiting that a kind of hidden unit may work well, of these units. It is
usually imp ossible
and then training a net to predict
network in adv ance which will
work with that kind of hidden unit and ev w ork b est. The design
evaluating pro
aluating itscess
consists
p erformance of trial
on aand error, inset.
validation tuiting that a kind of hidden unit may work well,
and then training a network with that kind of hidden unit and evaluating its
Some of the hidden units included in this list are not actually differentiable at
p erformance on a validation set.
all input p oints. For example, the rectified linear function g (z ) = max max{ {0 , z } is not
Some
differen
differentiable of the hidden units
tiable at z = 00.. This may seem lik included in this
likee it invlist are
invalidates not actually differentiable
alidates g for use with a gradient- at
all input
based p oints.algorithm.
learning For example, the rectified
In practice, gradientlineardescent
function g (zp)erforms
still = max w0ell is not
, z enough
differen
for these tiable
mo
modelsat z to
dels =b 0.e This
used may seem likelearning
for machine it invalidatestasks.g Thisfor use is inwith { a gradient-
part }
because
based learning
neural netw
network orkalgorithm.
training algorithmsIn practice, dogradient
not usually descent stillatp erforms
arrive a lolocal well enough
cal minimum of
for these mo dels to b e used for machine
the cost function, but instead merely reduce its value significan learning tasks. This
significantlyis in
tly part because
tly,, as shown in
neural
Fig. 4.3netw
. Theseork training
ideas willalgorithms
b e describ
described do
ed notfurtherusually arrive at8. aBecause
in Chapter lo cal minimumwe do not of
the
exp
expectcosttraining
ect function, but instead
to actually reach merely
a p oin
ointreduce
t whereitsthe value significan
gradient is 0 ,tlyit ,isasacceptable
shown in
Fig.the
for 4.3minima
. Theseofideas the cost willfunction
b e describ to ed further
corresp
correspond ondin toChapter
p oints with 8. Because
undefinedwegradient. do not
exp ect training to actually reach a p oin
Hidden units that are not differentiable are usually non-differen t where the gradient is 0
non-differentiable , it is acceptable
tiable at only a
for thenum
small minima
numb b er ofof ptheoin cost
oints.ts. In function
general, to acorresp
function ondgto (z )p oints
has awith left undefined
deriv
derivativeativegradient.
defined
Hidden units that are not differentiable are
by the slope of the function immediately to the left of z and a right deriv usually non-differen tiable at only
ativeea
derivativ
ativ
small num
defined bybtheer ofslopp oin
slope e ofts. the
In general, function g (z )tohas
functionaimmediately thea right
left deriv
of z .ative defined
A function
b y the
is differen slope
differentiable of the function immediately
tiable at z only if b oth the left deriv to the
derivative left of z and
ative and the right deriv a right deriv
derivativ
ativ
ative ativ
e aree
defined b
defined y the
and slopto
equal e of the other.
each function The immediately
functions used to the in right
the con of ztext
context . A of function
neural
is
netdifferen
networks tiable
works usually ha at z
have only if b oth
ve defined left deriv the left
derivatives deriv ative and
atives and defined right deriv the right deriv
derivativ ativ ativ
atives.
es. In e are
the
defined and
case of g (z ) = max equal
max{ to each other.
{0 , z }, the left deriv The functions used in the
ative at z = 0 is 0 and the right deriv
derivative con text of neural
derivativ
ativ
ativee
net works
is 1. Soft
Softwareusually
ware implemen ha ve defined
implementations left deriv
tations of neural netw atives and
network ork training usually return onethe
defined right deriv ativ es. In of
the of g (z ) =deriv
caseone-sided maxativ
derivativ 0 , zes, rather
atives the leftthan deriv ative
rep
reporting
ortingat zthat = 0theis 0deriv
and ative
derivativethe right derivativor
is undefined e
is 1. Soft
raising anwareerror.implemen{ ma
This } tations
may of neural netw
y b e heuristically ork training
justified usually that
by observing return one of
gradien
gradient- t-
the one-sided deriv ativ es rather
based optimization on a digital computer is sub than rep orting that
subject the
ject to nuderiv ative
numerical is
merical error anundefinedanyw
yw
ywayor
ay
ay..
raising an error.
When a function is asked to ev This ma y b e heuristically
aluate g(0)
evaluate (0),, it is very unlikely that the underlyingt-
justified b y observing that gradien
vbased
alue trulyoptimization
was 0. Instead,on a digitalit was computer
likely to b eissome
likely sub jectsmallto vnualuemerical
 thaterror anyway.
was rounded
When
to 0. In a function
some con is
contexts, asked
texts, moreto ev aluate g(0),pleasing
theoretically it is veryjustifications
unlikely that arethe av underlying
available,
ailable, but
vthese
alue truly was 0 . Instead,
usually do not apply to neural net it w as likely to
network b e some small
work training. The imp v alue 
importan that
ortan was
ortantt p oin rounded
ointt is that
to 0 . In some con texts, more theoretically
in practice one can safely disregard the non-differen pleasing justifications
non-differentiability are av
tiability of the hidden unit ailable, but
theseation
activ usually
activation do not apply
functions describ
described toedneural
b elo w.network training. The imp ortant p oint is that
elow.
in practice one can safely disregard the non-differentiability of the hidden unit
Unless indicated otherwise, most hidden units can b e describ described ed as accepting
activation functions describ ed b elow.
a vector of inputs x, computing an affine transformation z = W >x + b, and
thenUnless
applying indicated
an elemen otherwise,
element-wiset-wise most hidden
nonlinear units can
function g( z)b.e Most
describ ed as accepting
hidden units are
a vector of inputs
distinguished from eac x ,
each computing an affine transformation
h other only by the choice of the form of the activ z = W x + b,ation
and
activation
then applying
function g (z ). an element-wise nonlinear function g( z). Most hidden units are
distinguished from each other only by the choice of the form of the activation
function g (z ). 191
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Rectified linear units use the activation function g (z ) = max{0, z }.


activation
Rectified
Rectified linear
linear units
units useare
theeasy
activto optimize
ation b ecause
function g (z ) they
= max are0so
, z similar
. to linear
units. The only difference b et etween
ween a linear unit and a rectified linear unit is
{ so }similar to linear
thatRectified
a rectifiedlinear units
linear areoutputs
unit easy to zero
optimize b ecause
across half its they are
domain. This makes the
units.
deriv The
derivatives only difference b etween a linear unit
atives through a rectified linear unit remain large whenev and a rectified
wheneverer the linear
unit isunit is
active.
thatgradients
The a rectifiedarelinear unit large
not only outputsbut zero
also across half
consisten
consistent. its domain.
t. The Thisative
second deriv makes
derivative the
of the
derivatives
rectifying opthrough
operation a rectified linear
eration is 0 almost ev unit
everywhere,remain large
erywhere, and the deriv whenev er
derivative the unit is active.
ative of the rectifying
The
op gradients
operation are not only large but also
eration is 1 everywhere that the unit is activ consisten
active. t. The second
e. This means that derivthe
ative of the
gradient
rectifyingisop
direction eration
far is 0 almost
more useful everywhere,
for learning than itandwouldthe bderiv ative
e with activofation
the rectifying
activation functions
op eration is 1 everywhere that
that introduce second-order effects. the unit is activ e. This means that the gradient
direction is far more useful for learning than it would b e with activation functions
thatRectified
introduce linear units are effects.
second-order typically used on top of an affine transformation:

Rectified linear units are typically > on top of an affine transformation:


used
h = g (W x + b). (6.36)

When initializing the parameters h =ofgthe(W affinex + btransformation,


). it can b e a(6.36)
go
goood
practice to set all elements of b to a small, p ositive value, such as 0.1. This makes
When
it veryinitializing
likely that the
likely the parameters
rectified linear of theunitsaffine
will b transformation,
e initially activeit for canmost
b e ainputs
go o d
practice
in to set all
the training setelements
and allow b toderiv
of the a small,
derivatives
atives p ositive
to passvalue, such as 0.1. This makes
through.
it very likely that the rectified linear units will b e initially active for most inputs
Sev
Several
in the eral generalizations
training set and allowofthe rectified lineartounits
derivatives pass exist.
through. Most of these general-
izations p erform comparably to rectified linear units and o ccasionally p erform
Several generalizations of rectified linear units exist. Most of these general-
b etter.
izations p erform comparably to rectified linear units and o ccasionally p erform
One drawback to rectified linear units is that they cannot learn via gradient-
b etter. drawback
based methods on examples for which their activ activation
ation is zero. A v variety
ariety of
One drawbac k to rectified linear units is that they
generalizations of rectified linear units guarantee that they receive gradient cannot learn via gradient-
ev
every-
ery-
based
where. methods on examples for which their activ ation is zero. A v ariety of
generalizations of rectified linear units guarantee that they receive gradient every-
Three generalizations of rectified linear units are based on using a non-zero
where.
slopee α i when zi < 0: hi = g ( z , α) i = max
slop max(0 (0, zi) + α i minmin(0 (0, zi ). Absolute value
Three
rectific ation fixes α i = −1 to obtain g (z) = |z |. It is used foronob
ctification generalizations of rectified linear units are based using
object a non-zero
ject recognition
slop e α when z < 0 : h = g ( z ,
from images (Jarrett et al., 2009), where it makα ) = max (0 ,
makesz ) + α min (0 , z ). A bsolute
es sense to seek features that value
are
rinv
inevctific
ariantation fixes
under ap αolarity
= 1rev toersal
obtain
reversal g (z)input
of the = z illumination.
. It is used for ob ject
Other recognition
generalizations
from
of imageslinear
rectified (Jarrett
units et−al., more
are 2009),broadly
where it mak
| | es sense
applicable. A letoaky
leakyseek ReLUfeatures
(Maas that
et are
al.,
in variant under a p olarity rev ersal of
2013)) fixes αi to a small value like 0.01 while a par
2013 the input illumination.
arametric Other
ametric ReLU or PR generalizations
PReLU
eLU treats
of rectified linear units are more
αi as a learnable parameter (He et al., 2015). broadly applicable. A le aky R eLU ( Maas et al.,
2013) fixes α to a small value like 0.01 while a parametric ReLU or PReLU treats
α as Maxout units (Go Gooo dfello
dfelloww et al.al.,, 2013a) generalize rectified linear units further.
a learnable parameter (He et al., 2015).
Instead of applying an elemen t-wise function g (z ), maxout units divide z in
element-wise into
to
Maxout units ( Go o dfello w et al. , 2013a ) generalize
groups of k values. Each maxout unit then outputs the maximum element of one rectified linear units further.
Instead of applying an element-wise function g (z ), maxout units divide z into
groups of k values. Each maxout unit then outputs the maximum element of one
192
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

of these groups:
g (z ) i = max z j (6.37)
of these groups: j∈G
g (z ) = max z (6.37)
where G (i) is the indices of the inputs for group i , { (i − 1) 1)kk + 1, . . . , ik ik}} . This
pro
provides
videsG a waway y of learning a piecewise linear function that resp respondsonds to multiple
where is the indices
directions in the input x space. of the inputs for group i , (i 1) k + 1, . . . , ik . This
provides a way of learning a piecewise linear function { that− resp onds to m } ultiple
A maxout unit can learn a piecewise linear, con
conv vex function with up to k pieces.
directions in the input x space.
Maxout units can thus b e seen as itself rather
A maxout unit can
than just the relationship b etw learn a piecewise linear, con vex function
een units. With large enough k , a maxout
etween with up k pieces.
to unit can
Maxout units
learn to appro can
approximate thus
ximate an b e
any seen
y conv
convexasex function with arbitrary fidelit fidelity itself
y. In particular,rather
than just
a maxout la the
layer relationship
yer with tw two b etw een units. With large enough
o pieces can learn to implement the same function k , a maxout unit of can
the
learn x
input toasappro ximate an
a traditional layy er
layer conv
usingex function
the rectified with arbitrary
linear activ
activationfidelit
ation y. In particular,
function, absolute
vaalue
maxout layer with
rectification two pieces
function, can leaky
or the learn to or implement
parametricthe sameorfunction
ReLU, can learn of the
to
input
implemen x as a traditional lay er using the rectified
implementt a totally different function altogether. The maxout lay linear activ ation function,
layer absolute
er will of course
vbalue rectification function,
e parametrized differently from an or theanyleaky or parametric
y of these other la yer types, so can
layer ReLU, or the learn
learning to
implementwill
dynamics a totally different
b e differen
different t even function
in the casesaltogether. The maxout
where maxout learnslay toerimplement
will of course the
b e parametrized differently
same function of x as one of the other lay from an y of these
layer
er types.other la yer types, so the learning
dynamics will b e different even in the cases where maxout learns to implement the
Eac
Eachh maxout unit is now parametrized by k weight vectors instead of just one,
same function of x as one of the other layer types.
so maxout units typically need more regularization than rectified linear units. They
can Eac
workh maxout
well without unit isregularization
now parametrized if the by k weight
training set visectors
large instead
and theofnumber just one, of
so maxout units typically
pieces p er unit is kept lo low need more regularization
w (Cai et al., 2013). than rectified linear units. They
can work well without regularization if the training set is large and the number of
Maxout
pieces units
p er unit is hav
have
kepte alow few(Cai
other b enefits.
et al. , 2013).In some cases, one can gain some sta-
tistical and computational adv advantages
antages by requiring few fewer er parameters. Sp Specifically
ecifically
ecifically,,
if theMaxout
features units have a few
captured by notherdifferenb enefits.
different t linear Infilters
some cases,
can b eone can gain some
summarized sta-
without
tisticalinformation
losing and computational by takingadv theantages
max ov byerrequiring
over each group fewof erkparameters.
features, then Sp ecifically
the next,
if
la the
layer features captured by n
yer can get by with k times fewer weights. differen t linear filters can b e summarized without
losing information by taking the max over each group of k features, then the next
layerBecause
can geteach unit kis times
by with drivenfewerby mw ultiple
eights.filters, maxout units hav havee some redun-
dancy that helps them to resist a phenomenon called catastr atastrophic
ophic for forgetting
getting in
whic
which Because
h neural netweach unit
orks forget how to p erform tasks that they were trainedredun-
networks is driven b y m ultiple filters, maxout units hav e some on in
dancy that
the past (Go Goo helps
o dfellothem
dfellow to
w et al. resist
al.,, 2014a). a phenomenon called catastr ophic for getting in
which neural networks forget how to p erform tasks that they were trained on in
the Rectified
past (Go olineardfellounits
w et al.and all of).these generalizations of them are based on the
, 2014a
principle that mo models
dels are easier to optimize if their behavior is closer to linear.
Rectified linear
This same general principle units andofall of these
using lineargeneralizations
b ehavior to obtain of them are optimization
easier based on the
principle that mo
also applies in other con dels are
contexts easier to optimize
texts b esides deep linear netw if their behavior
networks. is
orks. Recurrent netw closer to
networksorkslinear.
can
This
learnsame
from general
sequences principle
and pro of
produce using
duce a linear
sequence b ehavior
of states toandobtain easier
outputs. optimization
When training
also applies
them, one needsin other contexts b
to propagate esides deepthrough
information linear netwsev orks.time
several
eral Recurrent
steps, which networks can
is much
learn from
easier whensequences
some linear andcomputations
pro duce a sequence (with someof states and outputs.
directional deriv When btraining
derivatives
atives eing of
them, one needs
magnitude near 1) are invto propagate
involv
olv information
olved. through
ed. One of the best-p sev eral
best-performing time steps,
erforming recurren which
recurrentt net is
netwmwuch
ork
easier when some linear computations (with some directional derivatives b eing of
magnitude near 1) are involved. One 193 of the best-p erforming recurrent network
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

arc
architectures,
hitectures, the LSTM, propagates information through time via summation—a
particular straightforw
straightforward
ard kind of suc
such
h linear activ
activation.
ation. This is discussed further
architectures,
in Sec. 10.10. the LSTM, propagates information through time via summation—a
particular straightforward kind of such linear activation. This is discussed further
in Sec. 10.10.

Prior to the introduction of rectified linear units, most neural netw


networks
orks used the
logistic sigmoid activ
activation
ation function
Prior to the introduction of rectified linear units, most neural networks used the
logistic sigmoid activation functiong (z ) = σ (z ) (6.38)

or the hyperb
hyperbolic
olic tangent activ g (zfunction
activation
ation ) = σ (z ) (6.38)

or the hyperb olic tangent activation


g (z ) function
= tanh(z ). (6.39)

These activ
activation g (z ) =
ation functions are closely tanh(zb)ecause
related . tanh(z ) = 2σ(2z ) − 1(6.39)
.
We ha have
ve already seen sigmoid units as output units, used to predict the
These activation functions are closely related b ecause tanh(z ) = 2σ(2z ) 1.
probabilit
probability y that a binary variable is 1. Unlike piecewise linear units, sigmoidal
W e ha ve already seen ofsigmoid units as output units, toused −
to predict the
units saturate across most their domain—they saturate a high value when
probabilit y that a binary variable is 1 . Unlike piecewise linear
z is very p ositive, saturate to a low value when z is very negative, and are only units, sigmoidal
units saturate
strongly sensitive across mostinput
to their of theirwhendomain—they
z is near 0. saturate to a high
The widespread value when
saturation of
zsigmoidal
is very punits
ositive, saturate
can mak to a low value when z is very negative,
makee gradient-based learning very difficult. For this reason, and are only
strongly sensitive
their use as hidden units to theirininput when z isnetw
feedforward near
networks0. is
orks Theno
nowwidespread
w discouraged.saturation
Their use of
sigmoidal units can mak e gradient-based
as output units is compatible with the use of gradien learning very difficult.
gradient-based F or this
t-based learning when an reason,
their use as hidden units in feedforward
appropriate cost function can undo the saturation of the netw orks is no w discouraged.
sigmoid in the Their
output use
as
la output units is compatible with the use of gradient-based learning when an
layer.
yer.
appropriate cost function can undo the saturation of the sigmoid in the output
When a sigmoidal activ activation
ation function must b e used, the hyperb hyperbolic
olic tangent
layer.
activ
activation
ation function typically p erforms b etter than the logistic sigmoid. It resembles
the When
identit
identityay sigmoidal
function more activclosely
ation function
closely, must that
, in the sense b e used, the =
tanh(0) hyperb
0 whileolicσ (0)
tangent
= 12 .
activationtanh
Because function typically
is similar p erforms
to iden
identity b etter
tity near 0, than the logistic
training a deep sigmoid.
neural net Itwork
resembles
network yŷˆ =
the
>
w tanhidentit y>
tanh((U tanhfunction > sense that tanh
tanh((V x)) resembles training a linear model yŷˆ = w U V x so.
more closely , in the (0) = 0 while
> σ
> (0)
> =
Because tanh
long as the activ is ations
similaroftothe
activations iden tity
netw
network
orknear
can0b, etraining a deep
kept small. Thisneural
mak
makesesnet work yˆthe
training =
w
tanhtanh (Uorktanh
netw
network (V x)) resembles training a linear model yˆ = w U V x so
easier.
long as the activations of the network can b e kept small. This makes training the
tanhSigmoidal activ
activation
network easier. ation functions are more common in settings other than feed-
forw
forward
ard netw
networks.
orks. Recurren
Recurrentt netw networks,
orks, man
many y probabilistic mo models,
dels, and some
auto Sigmoidal
autoencoders
encoders ha activ
haveve additional requirements that rule out the use of than
ation functions are more common in settings other feed-
piecewise
forward
linear netw
activ
activation orks.
ation Recurren
functions andt netw
makeorks, many units
sigmoidal probabilistic
more app mo dels,despite
appealing
ealing and some the
auto
dra encoders
drawbacks ha ve
wbacks of saturation. additional requirements that rule out the use of piecewise
linear activation functions and make sigmoidal units more app ealing despite the
drawbacks of saturation.
194
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Man
Many
y other types of hidden units are p ossible, but are used less frequently
frequently..

ManIn general,
y other a wide
types variet
of hidden ariety y of differentiable
units are p ossible, but functions
are used p erform p erfectly
less frequently . well.
Man
Many y unpublished activ activation
ation functions p erform just as well as the p opular ones.
To pro In general,
provide
vide a a wide
concrete v ariet
example, y ofthedifferentiable
authors tested functions p erform net
a feedforward p erfectly
work using
network well.
hMan y unpublished
= cos
cos( (W x + b) on activ
theation
MNIST functions
datasetp erform just as an
and obtained well as the
error ratep opular ones.
of less than
To pro
1%, videisa comp
which concrete
competitiv
etitiv
etitive example,
e with results the authors
obtainedtestedusing amore feedforward
con
conven
ven network
ventional
tional activ using
activation
ation
h = cos ( W x +
functions. During researcb) on the
research MNIST
h and dev dataset
development and obtained
elopment of new tec an error
techniques, rate of less
hniques, it is common than
1%,
to test whichmanyis comp etitiv
differen
different t eactiv
withation
activationresults obtainedand
functions usingfindmore thatcon ventional
several activation
variations on
functions.practice
standard Duringp erform
researchcomparably
and development
comparably. . This means of new thattecusually
hniques, new it hidden
is common unit
ttoyp test
ypes
es are many differen
published t activ
only ation
if they arefunctions and find thattoseveral
clearly demonstrated providevariations
a significanton
standard
impro
improvemen
vemenpractice
vement. p erform comparably . This means that
t. New hidden unit types that p erform roughly comparably to known usually new hidden unit
ttyp
yp es
ypes
es are
are published
so commononly as to if they are clearly demonstrated to provide a significant
b e uninteresting.
improvement. New hidden unit types that p erform roughly comparably to known
It would b e impractical to list all of the hidden unit types that hav havee app
appeared
eared
typ es are so common as to b e uninteresting.
in the literature. We highlight a few esp especially
ecially useful and distinctiv
distinctivee ones.
It would b e impractical to list all of the hidden unit types that have app eared
Oneliterature.
in the p ossibilityWis to not hav
e highlight have
aefew
an activ ation guseful
activation
esp ecially
(z ) at all. One can also think of
and distinctive ones.
this as using the iden identity
tity function as the activ activation
ation function. We ha hav ve already
One p ossibility is to not hav e an activ
seen that a linear unit can be useful as the output of a neural netation g ( z ) at all. One can work.think
also
network. It mayof
this base using
also used asthe identityunit.
a hidden function
If every as lay
theeractiv
layer of theation function.
neural netw orkWconsists
network e have already
of only
seen that a linear unit
linear transformations, then the netw can be useful
networkas the output of a neural
ork as a whole will b e linear. Ho net work.
Howev
wevIter,
wever, mayit
also
is b e used as
acceptable fora some
hidden lay unit.
layersers If
of every
the lay
neural er of
net the
network
work neural
to b e netw
purelyork consists
linear. of only
Consider
alinear
neural transformations,
netw
networkork lay
layer then nthe
er with netwand
inputs ork pasoutputs,
a wholehwill = gb(eWlinear.
> x + bHowever, it
). We ma may y
is acceptable
replace this with tw for some
two o laylay
layers, ers of the
ers, with one lay neural
layer net work to b e purely linear.
er using weight matrix U and the other Consider
a neural netw ork lay er with
weightt matrix V . If the first lay
using weigh n inputs and
layer
er has p outputs,
no activ h = gfunction,
activation
ation ( W x +then b ). W wee ha
ma
havey
ve
replace
essen
essentially thisfactored
tially with twothe layers,
weigh
weightwith one lay
t matrix oferthe
using weightlay
original matrix
layer U and
er based on the
W . other
The
using weigh t matrix V . If the first lay
factored approach is to compute h = g(V U x + b ). If U pro er has
> no> activ ation function, then we
duces q outputs,
produces have
essenU
then tially
andfactored
V together the contain
weight matrix
only (n of + pthe original layer
) q parameters, based
while W on con W
contains. The
tains np
factored approach is to compute h
parameters. For small q , this can b e a considerable sa = g (V U x + b ). If U
saving pro duces q
ving in parameters. Itoutputs,
then Uatand
comes theVcosttogether contain only
of constraining the(n + p) qtransformation
linear parameters, while to b eWlo con np
tains but
low-rank,
w-rank,
parameters.
these low-rankFor small q , this
relationships are can
oftenb esufficient.
a considerable
Linear sa ving in
hidden parameters.
units thus offer an It
comes
effectiv
effective at
e wthe
ay ofcost of constraining
reducing the num
numb bthe
er oflinear transformation
parameters in a net to b e low-rank, but
network.
work.
these low-rank relationships are often sufficient. Linear hidden units thus offer an
Softmax units are another kind of unit that is usually used as an output (as
effective way of reducing the numb er of parameters in a network.
describ
described ed in Sec. 6.2.2.3) but ma may y sometimes b e used as a hidden unit. Softmax
Softmax units are another
units naturally represent a probabilit kind
probability of unit that is ousually
y distribution used asvan
ver a discrete output
ariable with(ask
describ ed in Sec. 6.2.2.3 ) but ma y sometimes b e used
p ossible values, so they may b e used as a kind of switch. These kinds of hidden as a hidden unit. Softmax
units are
units naturally
usuallyrepresent
only used a probabilit
in more adv y distribution
advanced over a discrete
anced architectures variablelearn
that explicitly withto k
p ossible values,
manipulate memory
memory,so they mayedb einused
, describ
described Sec.as10.12a kind
. of switch. These kinds of hidden
units are usually only used in more advanced architectures that explicitly learn to
manipulate memory, describ ed in Sec. 10.12.
195
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

A few other reasonably common hidden unit types include:


 
A few other reasonably common hidden unit types include:
• Radial bbasisasis function or RBF unit: hi = exp −σ1 || W:,i − x||2 . This
||W
function b ecomes more active as x approac approaches hes a template W :,i. Because it
Radial basis function or RBF unit: h = exp W x . This
saturates to 0 for most x, it can b e difficult to optimize.
• function b ecomes more active as x approaches a template − || W− . ||Because it
• Softplus a
Softplus:: gto
saturates ( a)0 =forζ (most
a ) = log
x, (1 + e b).e This
it can difficultis a to
smo optimize.
smoothoth version of the rectifier,
in
intro
tro
troduced
duced by Dugas et al. ( 2001 ) for function approximation
oth version of and by Nair
rectifier,
Softplus : g( a ) = ζ (a ) = log(1 + e ). This is a smo
and Hinton (2010) for the conditional distributions of undirected probabilistic the
in
• mo tro
models.duced
dels. by Dugas
Glorot et al. et al. ()2001
(2011a ) for function
compared approximation
the softplus and rectifierandandbyfound
Nair
and Hinton ( 2010 ) for the conditional distributions of undirected
b etter results with the latter. The use of the softplus is generally discouraged. probabilistic
mo dels.
The Glorot
softplus et al. (2011athat
demonstrates ) compared the softplus
the p erformance of and rectifier
hidden unit and
typesfound
can
b etter
b e veryresults withtuitiv
counterin
counterintuitiv the latter.
e—oneThe
tuitive—one mightuse ofexpthe
ectsoftplus
expect it to havis generally
have e an adv discouraged.
advan
an
antage
tage over
Therectifier
the softplusdue demonstrates that tiable
to b eing differen the p erformance
differentiable ev erywhereof
everywhere orhidden
due to unit types less
saturating can
b e very
completely counterin tuitiv
completely,, but empirically it do e—one might
does
es not. exp ect it to hav e an adv an tage over
the rectifier due to b eing differentiable everywhere or due to saturating less
• Har
Hard d tanh :, this
completely but empirically
is shap
shaped to the tanh and the rectifier but unlik
it do es not.
ed similarly unlikee
the latter, it is b ounded, g(a) = max (− 1, min min(1(1, a))
)).. It was introduced
Har d tanh
by Collob
Collobert :
ertthis
( is
2004 shap
). ed similarly to the tanh and the rectifier but unlike
• the latter, it is b ounded, g(a) = max ( 1, min(1, a)). It was introduced
by Collob
Hidden unit ert (2004
design ).
remains activee area −
an activ of researc
research h and manmanyy useful hidden
unit types remain to b e disco discovered.
vered.
Hidden unit design remains an active area of research and many useful hidden
unit types remain to b e discovered.
6.4 Arc
Architecture
hitecture Design

6.4 Arc
Another hitecture
key design Design
consideration for neural netw
networks
orks is determining the architecture.
The word ar archite
chite
chitectur
ctur
cturee refers to the ov overall
erall structure of the net
netw
work: ho
how w many
Another key design
units it should ha haveconsideration for neural networks is determining the
ve and how these units should b e connected to each other. architecture.
The word architecture refers to the overall structure of the network: how many
unitsMost neuralhanet
it should networks
veworks
and howare these
organized
units into groups
should of units called
b e connected lay
layers.
to each ers. Most
other.
neural netw
network
ork architectures arrange these lay layers
ers in a chain structure, with eaceachh
la yerMost
layer neural
b eing networks
a function of theare
la organized
layer
yer into groups
that preceded it. In of units
this called the
structure, layers.
first Most
la
layer
yer
neural
is givennetw
by ork architectures arrangethese layers in a chain structure, with each
layer b eing a function of the(1) layer that preceded it. In this structure, the first layer
h = g(1) W (1)> x + b(1) , (6.40)
is given by
the second laylayer
er is giv
given
en bhy = g W x+b , (6.40)
 
the second layer is given hb(2) y = g(2) W (2)>h (1) + b(2) , (6.41)

and so on. h =g 
W h +b  , (6.41)

and so on. 196

 
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

In these chain-based arc architectures,


hitectures, the main architectural considerations are
to cho
hoose
ose the depth of the net network
work and the width of each lay layer.
er. As we will see,
In
a netw these
network chain-based architectures,
ork with even one hidden lay layer the main
er is sufficien architectural
sufficientt to fit the trainingconsiderations
set. Deep are
Deeperer
to
netcho ose
networks the depth of the net work
works often are able to use far few and
fewer the width
er units p er la of
layer each layer. As we will
yer and far fewer parameters see,
a netw ork with even one hidden lay er is sufficien
and often generalize to the test set, but are also often t to fit harder
the training set. Deep
to optimize. er
The
networks
ideal netw often
network are able tofor
ork architecture usea far
taskfew
merustunits p er lavia
b e found yerexp
and far fewer parameters
experimentation
erimentation guided by
and often generalize to the test
monitoring the validation set error. set, but are also often harder to optimize. The
ideal network architecture for a task must b e found via exp erimentation guided by
monitoring the validation set error.

A linear mo model,
del, mapping from features to outputs via matrix multiplication, can
by definition represen
representt only linear functions. It has the adv advantage
antage of b eing easy to
A linear mo del, mapping from features
train b ecause many loss functions result in conv to outputs
convex via matrix
ex optimization multiplication,
problems when can
by definition
applied represen
to linear mo
models.t only
dels. linear functions.
Unfortunately
Unfortunately, , we oftenIt has wanthe
want t toadv antage
learn of b eing
nonlinear easy to
functions.
train b ecause many loss functions result in convex optimization problems when
At first glance, we migh mightt presume that learning a nonlinear function requires
applied to linear mo dels. Unfortunately, we often want to learn nonlinear functions.
designing a sp specialized
ecialized mo model
del family for the kind of nonlinearity we wan wantt to learn.
A t first
Fortunately glance, we
ortunately,, feedforward netw migh t presume
networks that
orks with hidden la learning
layers a nonlinear
yers pro
provide
vide a univ function
ersal requires
universal appro
approxi-xi-
designing
mation framew a sp
framework.ecialized
ork. Sp mo del
Specifically
ecifically family for the
ecifically,, the universal appr kind of nonlinearity
approximation
oximation the theoror we
orem wan t to learn.
em (Hornik et al.,
F ortunately
1989; Cyb
Cybenko , feedforward netw orks with
enko, 1989) states that a feedforward netw hidden layers
networkork with a linearersal
pro vide a univ output approla xi-
layer
yer
mation
and framew
at least oneork. Sp ecifically
hidden lay er ,with
layer the universal
any “squashing”approximationactiv theorfunction
activation
ation em (Hornik (suc
(sucheth al.
as,
1989 ; Cyb enko
the logistic , 1989)activ
sigmoid states
activation thatfunction)
ation a feedforward network withany
can approximate a linear
Boreloutput
measurable layer
and at least
function fromone onehidden layer with any
finite-dimensional space “squashing”
to another activ ation
with an
any yfunction (such as
desired non-zero
the
amoun logistic sigmoid activ ation
amountt of error, provided that the net function)
netw can approximate any Borel
work is given enough hidden units. The measurable
function
deriv
derivatives
atives from
of one
the finite-dimensional
feedforward netw
network
ork space
can to another
also appro
approximatewith an
ximate they deriv
desired
derivatives
ativesnon-zero
of the
amoun t of error, provided that the net w ork is given enough
function arbitrarily well (Hornik et al., 1990). The concept of Borel measurability hidden units. The
deriv
is atives the
b eyond of the feedforward
scop
scope e of this bnetwo ok;ork
forcan ouralso approximate
purposes it suffices the deriv atives
to say thatof any
the
function
con arbitrarily well ( Hornik et al. , 1990 ). The concept
tinuous function on a closed and b ounded subset of Rn is Borel measurable
continuous of Borel measurability
is
and eyond
b thereforethema scop
y bee of this b o ok; for
approximated by our purposes
a neural net it suffices to say that any
may network.
work. RA neural netw networkork may
continuous
also appro
approximatefunction
ximate an
any yon a closed
function and b ounded
mapping from an any subset
y finite of is Boreldiscrete
dimensional measurablespace
and therefore ma y b e approximated b y a neural net
to another. While the original theorems were first stated in terms of units withwork. A neural netw ork may
also
activ appro
activation ximate an y function mapping
ation functions that saturate b oth for very negativfrom an y finite dimensional discrete
negativee and for very p ositive space
to another.
argumen
arguments, While the
ts, universal appro original
approximationtheorems
ximation theorems hav w ere first
havee stated in terms
also b een prov
proven of for
en units with
a wider
activation
class of activfunctions
activation that saturate
ation functions, whic
which b oth forthevery
h includes no
now wnegativ
commonlye and forrectified
used very p ositive
linear
argumen ts, universal
unit (Leshno et al., 1993). appro ximation theorems hav e also b een prov en for a wider
class of activation functions, which includes the now commonly used rectified linear
unitThe universal
(Leshno et al.approxim
approximation
, 1993). ation theorem means that regardless of what function
we are trying to learn, we kno know w that a large MLP will b e able to this
The universal
function. How ever,approxim
However, we are not ation theorem
guaran teed means
guaranteed that thethat regardless
training algorithmof what
willfunction
b e able
w
toe are tryingthat to learn, we
function. Evenknoifwthe
thatMLPa large
is ableMLP will b e able
to represent theto function, learning this
function.
can fail for twHow ever,
twoo differen we are not guaran teed that the training algorithm
differentt reasons. First, the optimization algorithm used for training will b e able
to that function. Even if the MLP is able to represent the function, learning
can fail for two different reasons. First, the 197 optimization algorithm used for training
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

ma
may y not b e able to find the value of the parameters that corresp correspondsonds to the desired
function. Second, the training algorithm might choose the wrong function due to
oma y not b e able
verfitting. Recallto find
fromthe value
Sec. 5.2.1of the
thatparameters
the “no free that corresp
lunch” onds to sho
theorem thews
shows desired
that
function. Second,
there is no universally sup the training
superior algorithm might choose
erior machine learning algorithm. Feedforw the wrong function
eedforward ard netw due
networks to
orks
overfitting.
pro
provide Recall from Sec.
vide a universal system for represen 5.2.1 that
representing the “no free lunch” theorem
ting functions, in the sense that, giv sho ws
giventhat
en a
there is nothere
function, universally
exists a sup erior machine
feedforward netw
networklearning
ork algorithm. Feedforw
that approximates ard netw
the function. orks
There
prono
is vide a universal
universal pro systemfor
procedure
cedure forexamining
representing functions,
a training setinofthe sp senseexamples
specific
ecific that, given anda
cfunction,
ho osing athere
hoosing existsthat
function a feedforward
will generalize networkto pthat
ointsapproximates
not in the training the function.
set. There
is no universal pro cedure for examining a training set of sp ecific examples and
The universal approximation theorem says that there exists a net network
work large
cho osing a function that will generalize to p oints not in the training set.
enough to ac achieve
hieve any degree of accuracy we desire, but the theorem do does es not
sa
say The universal
y how large this netw approximation
network theorem says
ork will b e. Barron (1993) pro that there
vides some b ounds onlarge
provides exists a net work the
enough to ac
size of a single-layhieve
single-layer any
er netw degree
network of accuracy w e desire, but
ork needed to approximate a broad class of functions. the theorem do es not
sa y how
Unfortunately large this netw ork
Unfortunately,, in the worse case, an exp will b e. Barron
exponential( 1993
onential num )
numb pro vides some b
b er of hidden units (p ounds on
(possiblythe
ossibly
size of
with onea hidden
single-lay uniter netw
corresporkonding
correspondingneededtotoeac approximate
each h input configuration a broad class that of functions.
needs to b e
Unfortunately
distinguished) ma , in
may the worse case, an exp onential num b er
y b e required. This is easiest to see in the binary case: the of hidden units (p ossibly
with one hidden
number of p ossible binaryunit corresp onding tooneac
functions h inputv configuration
vectors ∈ {0, 1}n is 22that andneeds to b e
selecting
one such function requires 2n bits, which will in general require O(2 n) degreesthe
distinguished) ma y b e required. This is easiest to see in the binary case: of
number of p ossible binary functions on vectors v
freedom. 0, 1 is 2 and selecting
one such function requires 2 bits, which will in general ∈ { require} O(2 ) degrees of
In
freedom. summary
summary, , a feedforward net
network
work with a single la
layer
yer is sufficient to represen
representt
an
any y function, but the lay layer
er may b e infeasibly large and ma may y fail to learn and
In summary
generalize correctly , a feedforward net work
correctly.. In many circumstances, using deep with a single la yer
deeper is
er mo sufficient
dels cantoreduce
models represen thet
nan y function,
umber of units butrequired
the layer to may
representb e infeasibly
the desired large and maand
function y failcan toreduce
learn and the
generalize
amoun correctly .
amountt of generalization error.In many circumstances, using deep er mo dels can reduce the
number of units required to represent the desired function and can reduce the
amounThere
t ofexist families oferror.
generalization functions whic which h can b e approapproximated
ximated efficien
efficiently tly by an
arc
architecture
hitecture with depth greater than some value d, but whic which h require a muc much h larger
mo
modelThere
del existisfamilies
if depth restricted of functions
to b e less thanwhichorcan equalb eto appro
d. Inximated
many cases,efficien thetly num
numbbybaner
arc hitecture with depth greater
of hidden units required by the shallow mo than some value
model d ,
del is expbut whic
exponen
onen h
onential require
tial in n. Suca muc
Such h larger
h results
mo del if
were first provdepth
proven is restricted
en for mo models to b e less than or equal
dels that do not resemble the contin to d . In many
continuous, cases,
uous, differenthe num
differentiable b er
tiable
of hidden
neural netw units
networks required by the shallow
orks used for machine learning, but hav mo del is exp onen tial in n .
havee since b een extended to these Suc h results
wmo eredels.
models.firstTheprovfirst
en for mo dels
results that
were fordocircuits
not resemble
of logicthe gates contin uous,, differen
(Håstad 1986). tiableLater
neural netw orks used for machine learning, but hav
work extended these results to linear threshold units with non-negative weigh e since b een extended to these
weights ts
mo dels. The first
(Håstad and Goldmann, 1991; Haresults w ere for
Hajnal circuits of logic gates
jnal et al., 1993), and then to netw ( Håstad , 1986
networks ). Later
orks with
wconork extended
continuous-v
tinuous-v
tinuous-valued these
alued activ results
activations to linear threshold units
ations (Maass, 1992; Maass et al., 1994). Many mowith non-negative weigh
modern ts
dern
(neural
Håstadnet and
netwowo Goldmann
works
rks use rectified, 1991;linear
Ha jnal et al.,Leshno
units. 1993), et andal.then (1993to) netw orks with
demonstrated
continuous-v
that shallow alued
netw
networks activ
orks ations
with (Maass
a broad , 1992
family of; non-p
Maassolynomial
et al., 1994
non-polynomial ). ation
activ Manyfunctions,
activation mo dern
neural netrectified
including works use linearrectified
units, linear
hav units. Leshno
havee universal approximation et al. (1993prop ) demonstrated
properties,
erties, but these
that shallow netw orks with a broad family
results do not address the questions of depth or efficiency—they sp of non-p olynomial activ ation
specify functions,
ecify only that
aincluding
sufficiently rectified
wide linear
rectifier units,
netw hav
network orke universal approximation
could represen
represent t any function. prop erties,
Pascan
Pascanubutu these
et al.
results do not address the questions of depth or efficiency—they sp ecify only that
a sufficiently wide rectifier network could represent any function. Pascanu et al.
198
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

(2013b) and Montufar et al. (2014) sho showed


wed that functions representable with a
deep rectifier net can require an exp exponen
onen
onential
tial num
umb b er of hidden units with a shallow
((one
2013b ) and
hidden layMontufar
layer)
er) net et
network.al. ( 2014 ) precisely,, theyfunctions
sho
work. More precisely wed that show
showed representable
ed that piecewisewithlineara
deep
net rectifier
networks
works net can
(which can brequire an exp
e obtained onen
from tial num
rectifier b er of hiddenorunits
nonlinearities maxoutwithunits)
a shallow
can
(one hidden
represen lay er) net
representt functions with a numwork.
numb More precisely , they
b er of regions that is expshow ed
exponen
onen that
onential piecewise linear
tial in the depth of the
net
net works Fig.
network.
work. (which
6.5can b e obtained
illustrates how afromnetw rectifier
ork withnonlinearities
network absolute valueorrectification
maxout units) can
creates
represenimages
mirror t functions
of thewith a numcomputed
function b er of regions thatofissome
on top exp onen tial in
hidden the with
unit, depthresp
of the
respect
ect
net work. Fig. 6.5 illustrates how a netw ork
to the input of that hidden unit. Each hidden unit sp with absolute value
specifies rectification creates
ecifies where to fold the
mirror images of the function computed
input space in order to create mirror resp on
responsestop of some hidden
onses (on b oth sides of the unit,absolute
with resp ect
value
to the input
nonlinearit
nonlinearity).y).ofBythat
comp hidden
composing unit. folding
osing these Each hidden
op unit sp
operations,
erations, weecifies
obtainwhere
an exp toonentially
fold the
exponentially
input nspace
large umber in of
order to create
piecewise mirror
linear resp onses
regions which(oncan
b oth sides ofallthe
capture absolute
kinds value
of regular
nonlinearit
(e.g., rep y). Bypatterns.
repeating)
eating) comp osing these folding op erations, we obtain an exp onentially
large number of piecewise linear regions which can capture all kinds of regular
(e.g., rep eating) patterns.

Figure 6.5: An intuitiv


intuitive,
e, geometric explanation of the exp exponential
onential adv
advanan
antage
tage of deep
deeperer
rectifier netw
networks
orks formally shown by Pascan
Pascanuu et al. (2014a) and by Montufar et al. (2014).
Figure
(L eft) An
(Left) 6.5:absolute
An intuitiv e, rectification
value geometric explanation
unit has theof same
the exp onential
output for adv
everyantage
pair of mirror
deep er
rectifier
p oin ts innetw
oints orks formally
its input. shown
The mirror axisbyofPascan u
symmetryet al. (2014a
is given by) and et al.
by Montufardefined (by
the hyperplane 2014
the).
(Leights
w eft) Anandabsolute value
bias of the rectification
unit. A functionunit has theon
computed same
top output
of that for
unitevery
(the pair
greenofdecision
mirror
p oints inwill
surface) its input. The mirror
b e a mirror image ofaxis of symmetry
a simpler is given
pattern acrossbthat
y theaxis
hyperplane
of symmetrydefined
symmetry. by the
. (Center)
weights
The and bias
function canofbethe unit. Abfunction
obtained y foldingcomputed
the space on top ofthe
around that unit
axis of (the green. decision
symmetry
symmetry. (R
(Right)
ight)
surface) will
Another b e a mirror
repeating image
pattern can bofe afolded
simpler
onpattern across
top of the firstthat
(byaxis of symmetry
another do
downstream. (Center)
wnstream unit)
The function
to obtain can be
another obtained(which
symmetry by folding therep
is now space
repeated
eatedaround the axis
four times, of tw
with symmetry
twoo hidden. lay(Right)
layers).
ers).
Another repeating pattern can b e folded on top of the first (by another downstream unit)
to obtain
More another
preciselysymmetry
precisely, , the main(which is nowin
theorem repMontufar
eated four ettimes,
al. (with
2014tw ) ostates
hiddenthatlayers).
the
number of linear regions carv carved
ed out by a deep rectifier net work with d inputs,
network
depthMore precisely
l , and , the
n units p ermain theorem
hidden layer, isin Montufar et al. (2014) states that the
layer,
number of linear regions carved out by a deep rectifier network with d inputs,
depth l , and n units p er hidden lay er,isd(l−1) !
n
O nd , (6.42)
d
n
O n , (6.42)
i.e., exp
exponential d case of maxout netw
onential in the depth l. In the networks
orks with k filters p er
unit, the num
numbb er of linear regions is
i.e., exp onential in the depth l. In the case of maxout networks with k filters p er
  !
unit, the numb er of linear regionsOis k (l−1)+d . (6.43)

O k 199 . (6.43)
 

 
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Of course, there is no guaran guarantee tee that the kinds of functions we wan wantt to learn in
applications of machine learning (and in particular for AI) share such a prop propertyerty
erty..
Of course, there is no guarantee that the kinds of functions we want to learn in
We ma may y also wan wantt to choose a deep mo model del for statistical reasons. Any time
applications of machine learning (and in particular for AI) share such a prop erty.
we choose a sp specific
ecific machine learning algorithm, we are implicitly stating some
W e ma y also
set of prior b eliefs wan wet tohavchoose
have e ab
about
outa deep
what mo kinddel of
forfunction
statistical thereasons.
algorithm Any time
should
w e choose
learn. Cho aosing
Choosing sp ecific
a deep machine
mo
modeldellearning
enco
encodesdes algorithm,
a very general we are implicitly
b elief that thestating
function some
we
set of prior b eliefs
want to learn should inv w e hav
involv
olv e ab
olvee comp out what
composition kind of function the algorithm
osition of several simpler functions. This can b e should
learn.
in Cho
interpreted osing a deep mo
terpreted from a representation learningdel enco des a very p oingeneral
oint
t of view b elief
as sa that
ying the
saying thatfunction
we b elievwee
elieve
w ant to learn should inv olv
the learning problem consists of disco e comp osition
discovering of several simpler functions.
vering a set of underlying factors of variation This can be
interpreted
that can in fromturnabrepresentation
e describ
described ed in learning
terms ofp oin t of view
other, simpleras sa ying that w
underlying e b elievofe
factors
vthe learning
ariation. problem ,consists
Alternately
Alternately, we can of in discovering
interpret
terpret the use a set
of aofdeep
underlying
arc
architecturefactorsasofexpressing
hitecture variation
athat canthat
b elief in turn b e describ
the function weedwant in terms
to learnof other, simpler program
is a computer underlying factors of
consisting of
vm ariation. Alternately
ultiple steps, where eac , we can
each in terpret the use of a deep arc hitecture
h step makes use of the previous step’s output. These as expressing
a
in b elief that the
intermediate
termediate function
outputs we w
are not ant to learn
necessarily is a computer
factors of variation, program
but can consisting
instead bofe
multiple steps,
analogous to counwhere
ters eac
counters or h step makes
p ointers that use
the of theork
netw
network previous
uses tostep’sorganize output. These
its internal
in
pro termediate
processing. outputs are
cessing. Empirically
Empirically, not necessarily
, greater depth dodoesesfactors
seem to of result
variation, but can
in b etter instead b e
generalization
analogous to coun ters or p ointers that the netw ork
for a wide variety of tasks (Bengio et al., 2007; Erhan et al., 2009; Bengio uses to organize its internal
, 2009;
pro cessing.
Mesnil et al.Empirically
al.,, 2011; Ciresan , greater depth
et al.
al.,, 2012do es seem to result
; Krizhevsky al.,,in2012
et al. b etter generalization
; Sermanet et al.
al.,,
for a wide
2013; Farab v
arabet ariety
et et al.of tasks ( Bengio
al.,, 2013; Couprie et al. et al. , 2007 ; Erhan
al.,, 2013; Kahou et al.et al. , 2009 ; Bengio
al.,, 2013; Go Goo , 2009
o dfello
dfellow w;
Mesnil
et et al.;, Szegedy
al., 2014d 2011; Ciresan et al.,).2012
et al., 2014a See; Fig.
Krizhevsky
6.6 and Fig.et al.6.7
, 2012
for ;examples
Sermanet of et al.,
some
2013 ; F arab et et al. , 2013 ; Couprie et al. , 2013 ;
of these empirical results. This suggests that using deep architectures do Kahou et al. , 2013 ; Go
does o dfello
es indeed w
et al., 2014d
express ; Szegedy
a useful prior et ov al.,the
over
er 2014a ). See
space Fig. 6.6 and
of functions theFig.
mo
model6.7 learns.
del for examples of some
of these empirical results. This suggests that using deep architectures do es indeed
express a useful prior over the space of functions the mo del learns.

So far we havhavee describ


describeded neural netnetworks
works as b eing simple chains of la layers,
yers, with the
main considerations b eing the depth of the netw network
ork and the width of eac each h la
layer.
yer.
So far we hav e describ
In practice, neural net ed neural
networks
works shoshownetworks as b eing simple
w considerably more div chains
diversity
ersity
ersity.. of layers, with the
main considerations b eing the depth of the network and the width of each layer.
Man
Many y neural netnetwork
work architectures ha have
ve b een dev
develop
elop
elopeded for sp specific
ecific tasks.
In practice, neural networks show considerably more diversity.
Sp
Specialized
ecialized arc
architectures
hitectures for computer vision called conv convolutional
olutional net networks
works are
Man
describ
described y neural net work architectures
ed in Chapter 9. Feedforward netw ha ve
networks b een dev elop ed for sp
orks may also b e generalized to ecific tasks.
the
Sp ecialized
recurren arc hitectures
recurrentt neural net
networks for computer
works for sequence pro vision called
processing, conv
cessing, describ
describedolutional net works
ed in Chapter 10, which are
describ
ha
have ed in
ve their ownChapter
arc 9. Feedforward
architectural
hitectural networks may also b e generalized to the
considerations.
recurrent neural networks for sequence pro cessing, describ ed in Chapter 10, which
In general, the lalayers
yers need not b e connected in a chain, ev even
en though this is the
have their own architectural considerations.
most common practice. Many arc architectures
hitectures build a main chain but then add extra
arc In general,
architectural the layers need
hitectural features to it, such asnot b e skip
connected in a chain,
connections goingevfrom
en though
lay er ithis
layer to is
lay the
layerer
imost common
+ 2 or higher.practice.
These skip Many architectures
connections makeebuild
mak a main
it easier chain
for the but then
gradient to add
flow extra
flow from
architectural
output lay ers features
layers to lay ers to
layers it, such
nearer the as skip connections going from layer i to layer
input.
i + 2 or higher. These skip connections make it easier for the gradient to flow from
output layers to layers nearer the input.200
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.6: Empirical results showing that deeper netw networks


orks generalize better when used
to transcrib
transcribee multi-digit numbers from photographs of addresses. Data from Go Gooo dfello
dfellow
w
Figure 6.6: Empirical results showing that deeper netw orks generalize better when
et al. (2014d). The test set accuracy consistently increases with increasing depth. See used
to transcrib
Fig. 6.7 for ae m
conulti-digit
control
trol exp numbers
experimen
erimen
eriment from photographs
t demonstrating that of addresses.
other Data
increases from
to the mo Go
delo dfello
model w
size do
et al.
not (2014d
yield the ).same
Theeffect.
test set accuracy consistently increases with increasing depth. See
Fig. 6.7 for a control exp eriment demonstrating that other increases to the mo del size do
not yield the same effect.

201
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.7: Deep


Deeper er mo
models
dels tend to p erform b etter. This is not merely b ecause the mo model
del is
larger. This expexperimen
erimen
erimentt from Go Goodfellow
odfellow et al. (2014d) shows that increasing the number
Figure
of 6.7: Deep
parameters inerla mo
layers
yers dels
of tend
con to p erformnetw
convolutional
volutional b etter.
orksThis
networks is notincreasing
without merely b ecause the moisdel
their depth is
not
larger. This exp
nearly as effectiv erimen t from Go odfellow et al. (2014d ) shows that increasing
effectivee at increasing test set p erformance. The legend indicates the depth of the number
of
net parameters
work used toinmake
network layers of con
each volutional
curve and whethernetworks without
the curve increasing
represents their in
variation depth is not
the size of
nearly
the convasolutional
effective or
convolutional at the
increasing test set play
fully connected erformance.
layers.
ers. We observThe elegend
observe indicates
that shallow mo the
models
delsdepth of
in this
net
con work
context used to make each curve and whether the curve represents
text overfit at around 20 million parameters while deep ones can b enefit from having variation in the size of
othe
verconv olutional
60 million. or the
This fully connected
suggests that using lay ers. mo
a deep Wedel
modelobserv e thatashallow
expresses mo dels in ov
useful preference this
over
er
context
the spaceoverfit at around
of functions the20mo
million
model
del canparameters
learn. Sp while deep
Specifically
ecifically
ecifically, , itones can b enefit
expresses fromthat
a b elief having
the
over 60 million.
function This suggests
should consist of many that using functions
simpler a deep mocomposed
del expresses a useful
together. preference
This over
could result
the space
either of functions
in learning the mo delthat
a representation can islearn.
comp
composedSp ecifically
osed in turn, ofit simpler
expresses a b elief
represen
representationsthat(e.g.,
tations the
functiondefined
corners should in consist
termsofofmanyedges)simpler functionsa composed
or in learning program with together. This could
sequentially dep result
dependent
endent
either(e.g.,
steps in learning
first lo acate
representation
locate a set of ob that then
objects,
jects, is comp osed inthem
segment turnfrom
of simpler represen
each other, tations
then (e.g.,
recognize
corners defined in terms of edges) or in learning a program with sequentially dep endent
them).
steps (e.g., first lo cate a set of ob jects, then segment them from each other, then recognize
them).

202
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Another key consideration of architecture design is exactly how to connect a


pair of la
layers
yers to each other. In the default neural netw network
ork la
layer
yer describ
described
ed by a linear
Another key consideration of architecture design is
transformation via a matrix W , every input unit is connected to every exactly how to connect
outputa
pair ofMany
unit. layerssptoecialized
each other.
specialized netw In
networksthein
orks default neural netw
the chapters orkhav
ahead layer
have describ
e fewer ed by a linear
connections, so
transformation
that eac
each via a matrix
h unit in the input lay W
layer, every input unit is connected to every
er is connected to only a small subset of units in output
unit.output
the Manylay sper.
ecialized
layer. Thesenetw orks infor
strategies thereducing
chaptersthe ahead
num
numb hav
b ere of
fewer connections,
connections so
reduce
thatneac
the h unit
umber in the inputand
of parameters layer
theis amount
connected to only a small
of computation subsetto
required of ev
units in
evaluate
aluate
the netw
the output
ork,lay
network, er. are
but These strategies
often for reducingenden
highly problem-dep
problem-dependen the num
t. Fb
endent. orerexample,
of connections
conv reduce
convolutional
olutional
the
net number
networks, of parameters and the
works, described in Chapter 9, use sp amount of computation
specialized required to
ecialized patterns of sparse connections ev aluate
the netw
that are ork,
very but are often
effective highly problem-dep
for computer endent.InFor
vision problems. thisexample,
chapter,conv
it isolutional
difficult
net works, described
to give much more sp in Chapter
specific 9 , use sp ecialized
ecific advice concerning the arc patterns
architecture of sparse connections
hitecture of a generic neural
that
net are
network. very
Subsequentt chapters develop the particular In
effective
work. Subsequen for computer vision problems. this chapter,
architectural it is difficult
strategies that
to
ha give
have m uch more sp ecific advice concerning the arc
ve b een found to work well for different application domains. hitecture of a generic neural
network. Subsequent chapters develop the particular architectural strategies that
have b een found to work well for different application domains.
6.5 Bac
Back-Propagation
k-Propagation and Other Differen Differentiation
tiation Algo-
rithms
6.5 Back-Propagation and Other Differentiation Algo-
rithms
When we use a feedforward neural network to accept an input x and pro
network produce
duce an
output yˆ, information flo flows
ws forward through the netw ork. The inputs x pro
network. provide
vide
When we use a feedforward neural netw ork to
the initial information that then propagates up to the hidden accept an x andatpro
input units duce
eac
eachh la an
layer
yer
output
and yˆ, information
finally pro duces y
produces ˆflo
ŷ . ws forward
This is called through
forwar
forwardthed netw
pr
propop ork. The
opagation
agation
agation. Duringx training,
. inputs provide
the
forw initial
forward information that
ard propagation can contin then
continue propagates
ue onw
onwardard un
untilup to
til it pro the
produces hidden units
duces a scalar cost at
costJ J (θh
eac ). layer
The
band finally
ack-pr
ack-propop pro duces
opagation
agation yˆ . This
algorithm is calledet forwar
(Rumelhart d pr
al., 1986a ),op agation
often . During
simply called btraining,
ackpr
ackpropop,
forwws
allo ardthe
allows propagation
information canfrom
contin
theuecost
onwto ard until
then flowit pro
backwduces
backwards ardsa scalar
throughcostthe θ). The
J (netw
network,
ork,
back-pr
in orderoptoagation
computealgorithm (Rumelhart et al., 1986a), often simply called backprop,
the gradient.
allows the information from the cost to then flow backwards through the network,
Computing an analytical expression for the gradien gradientt is straigh
straightforward,
tforward, but
in order to compute the gradient.
numerically ev evaluating
aluating such an expression can b e computationally exp ensive. The
expensive.
bac Computing
back-propagation an analytical
k-propagation algorithm do expression
does for the gradien
es so using a simple and inexp t is straigh
inexpensiv
ensiv tforward,
ensivee pro cedure.but
procedure.
numerically evaluating such an expression can b e computationally exp ensive. The
The term back-propagation
back-propagation algorithm do es isso often
using misundersto
misunderstoo
a simple ando dinexp as meaning
ensive prothe whole
cedure.
learning algorithm for multi-la multi-layer
yer neural netw networks.
orks. Actually
Actually,, bac back-propagation
k-propagation
The
refers term
only back-propagation
to the metho
method is often misundersto
d for computing the gradient,o dwhile as meaning the whole
another algorithm,
learning
suc
suchh as stoalgorithm
stochastic for multi-la
chastic gradient yer neural
descent, is used netw orks. Actually
to p erform learning ,using
back-propagation
this gradient.
refers only to the metho d for computing
Furthermore, back-propagation is often misundersto the gradient,
misunderstood while
od as b eing spanother
ecificalgorithm,
specific to multi-
suc
la h
layer as sto chastic
yer neural netw gradient
networks, descent, is used to p erform
orks, but in principle it can compute deriv learning
derivativ
ativusing
es of any gradient.
atives this function
F urthermore, back-propagation
(for some functions, the correct resp is often
response misundersto
onse is to rep report od as b eing
ort that the deriv sp ecific
derivativ
ativ to multi-
ativee of the
layer neural
function networks, but
is undefined). Sp in principle
Specifically
ecifically
ecifically, , we it can
will compute
describ
describe e ho
howderiv
w to ativ es of any
compute the function
gradien
gradientt
(for
∇x fsome
( x, y) functions, the correct
for an arbitrary functionresp onse isxto
f, where is arep
setort
of vthat the whose
ariables derivativ e of
deriv
derivativ
ativthe
ativeses
are desired, and y is an additional set of variables that are inputs to the functiont
function is undefined). Sp ecifically , we will describ e ho w to compute the gradien
f( x, y) for an arbitrary function f, where x is a set of variables whose derivatives
are desired, and y is an additional set of203variables that are inputs to the function

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

but whose deriv


derivatives
atives are not required. In learning algorithms, the gradient we most
often require is the gradien
gradientt of the cost function with resp respect
ect to the parameters,
but whose deriv atives are not required.
∇θ J( θ ). Many machine learning tasks in In learning
involv
volv algorithms,
volvee computing other thederiv
gradient
derivatives,weeither
atives, most
often
as require
part of theis learning
the gradien prot cess,
of theorcost
process, to function
analyze thewithlearned
resp ect mo
to del.
the parameters,
model. The back-
J( θ ). Many machine learning tasks involv e computing other deriv
propagation algorithm can b e applied to these tasks as well, and is not restricted atives, either
as part of thethelearning
∇ computing
to gradientpro of cess, or function
the cost to analyzewiththe learned
resp
respect moparameters.
ect to the del. The back-The
propagation algorithm
idea of computing deriv can
derivativesb e applied to these tasks as well, and
atives by propagating information through a netw is not restricted
network
ork is
to computing the gradient of the cost function with resp ect to the
very general, and can b e used to compute values such as the Jacobian of a functionparameters. The
idea of computing deriv atives by propagating information through
f with multiple outputs. We restrict our description here to the most commonly a netw ork is
vused
ery general,
case whereand fcanhasb ea used
singletooutput.
compute values such as the Jacobian of a function
f with multiple outputs. We restrict our description here to the most commonly
used case where f has a single output.

So far we havhavee discussed neural net networks


works with a relatively informal graph language.
To describ
describee the back-propagation algorithm more precisely precisely,, it is helpful to havhavee a
So far we hav e discussed
more precise computational gr neural
graphnetworks
aph language. with a relatively informal graph language.
To describ e the back-propagation algorithm more precisely, it is helpful to have a
more Man
Many y ways
precise of formalizing
computational computation
graph language.as graphs are p ossible.
Here,
Many we useofeach
ways no
node
de in computation
formalizing the graph to as indicate
graphsa are
variable. The variable may
p ossible.
b e a scalar, vector, matrix, tensor, or ev even
en a variable of another typ ype.
e.
Here, we use each no de in the graph to indicate a variable. The variable may
To formalize our graphs, we also need to in intro
tro duce the idea of an op
troduce oper
er
eration
ation
ation..
b e a scalar, vector, matrix, tensor, or even a variable of another typ e.
An op operation
eration is a simple function of one or more variables. Our graph language
To formalize by
is accompanied oura graphs, we also
set of allow ableneed
allowable op to intro duce
operations.
erations. the idea
Functions of an
more operation.
complicated
An op eration
than the op is
operationsa simple function
erations in this set ma may of one or
y b e describ more
described v ariables.
ed by comp
composing Our graph
osing many op language
operations
erations
is accompanied
together. by a set of allow able op erations. Functions more complicated
than the op erations in this set may b e describ ed by comp osing many op erations
Without loss of generality
together. generality,, w wee define an op operation
eration to return only a single
output variable. This do does
es not lose generalit
generality y b ecause the output variable can ha hav
ve
Without
multiple en loss
entries,
tries, sucof
such generality , w e
h as a vector. Softw define
Software an op eration to return only
are implementations of back-propagation a single
output vsupp
usually ariable.
ort This
support op do es not
operations
erations losemgeneralit
with y b ecausebut
ultiple outputs, theweoutput
avoidvariable
this casecaninha ve
our
multiple entries,
description because suchitasinatro
introvector.
duces Softw
troduces man
many yare
extraimplementations
details that are of not
back-propagation
imp
importan
ortan
ortantt to
usually supp ort op
conceptual understanding. erations with m ultiple outputs, but we avoid this case in our
description because it intro duces many extra details that are not imp ortant to
If a variable y is computed by applying an op eration to a variable x, then
operation
conceptual understanding.
we draw a directed edge from x to y . W Wee sometimes annotate the output node
withIfthea vname
ariableof ythe
is op
computed
operation by applying
eration applied, an optimes
and other eration
omitto this
a variable
lab
label x, then
el when the
w
ope draw
operation a directed edge
eration is clear from context. from x to y . W e sometimes annotate the output node
with the name of the op eration applied, and other times omit this lab el when the
Examples
op eration of computational
is clear from context. graphs are shown in Fig. 6.8.
Examples of computational graphs are shown in Fig. 6.8.

204
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.8: Examples of computational graphs. (a) The graph using the × op operation
 eration to
compute z = xy . (b) The graph for the logistic regression prediction yŷˆ = σ x w + b . >

Figureof6.8:
Some theExamples of computational
intermediate expressions do graphs.
not ha (a)
have veThenamesgraph usingalgebraic
in the the opexpression
eration to
compute
but z = xy (b)
need names. in the The graphW
graph. for the logistic
e simply nameregression prediction
the i-th suc
suchh variable yˆ =
×uσ( ) .x (c)
w+ b .
The
Some of the intermediate
computational graph for theexpressions
expression H do=notmaxha
max{ { 0ve
, Xnames
W + bin }, the
whichalgebraic
computes expression
a design
but need names in the graph.
matrix of rectified linear unit activ W e simply
activations
ations H givname
given the i-th suc
en a design matrix con h v ariable u (c) The
taining a .minibatch
containing
computational
of inputs X . (d)graph for the expression
Examples H = max
a–c applied at most one op 0, X W + b
operation , which
eration to eachcomputes
variable, a design
but it
matrix
is of rectified
p ossible to applylinear
moreunit
thanactiv
oneations
op H given
operation.
eration. Here{a design
we showmatrix containing agraph
}a computation minibatch
that

applies X . than
of inputsmore (d) Examples
one op a–c applied
operation
eration to the at most one
weights w ofopaeration to each variable,
linear regression mo
model.del. but it
P The
iseights
w p ossible
are to apply
used morethe
to make than one
b oth theopprediction
eration. Here
yŷˆ andwethe show
weigha tcomputation
weight decay p enalty graph
λ thatw 2.
applies more than one op eration to the weights w of a linear regression mo del. The
weights are used to make the b oth the prediction yˆ and the weight decay p enalty λ w .

205
P
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The chain rule of calculus (not to b e confused with the chain rule of probability) is
used to compute the deriv derivativ
ativ
atives
es of functions formed by comp composing
osing other functions
The chain
whose deriv rule
derivativ
ativof
atives calculus
es are kno
known.(not to
wn. Bac b e confused
Back-propagation with the chain
k-propagation is an algorithm that rule of probability)
computes the is
cused
haintorule,
compute
with athe sp derivativ
specific
ecific es of
order of functions
op erationsformed
operations that isbyhighly
compefficien
osing other
efficient.t. functions
whose derivatives are known. Back-propagation is an algorithm that computes the
Let x b e a real numnumb b er, and let f and g b oth b e functions mapping from a real
chain rule, with a sp ecific order of op erations that is highly efficient.
number to a real num numb b er. Supp ose that y = g (x) and z = f (g (x)) = f (y ). Then
Suppose
Let x b e a real num
the chain rule states that b er, and let f and g b oth b e functions mapping from a real
number to a real numb er. Supp osedzthatdz y =dyg (x) and z = f (g (x)) = f (y ). Then
= . (6.44)
the chain rule states that dx dy dx
dz dz dy
= . (6.44)
We can generalize this b ey eyond
ond dxthe scalar
dy dxcase. Supp ose that x ∈ R m, y ∈ Rn ,
Suppose
g maps from Rm to Rn , and f maps from R n to R. If y = g (x ) and z = R f (y ), then
R
We can generalize this b eyond the scalar case. Supp ose that x ,y ,
R R X R R
g maps from to , and f maps∂ z from ∂ zto∂ y j. If y = g (x ) and ∈ z = f (y ),∈then
= . (6.45)
∂xi ∂ y j ∂ xi
∂z j ∂z ∂y
= . (6.45)
∂x ∂y ∂x
In vector notation, this may b e equiv equivalently
alently written as
 >
In vector notation, this may b e equivalently ∂ y written as
∇xz = X ∇y z , (6.46)
∂x
∂y
z= z, (6.46)
∂y
where ∂x is the n × m Jacobian matrix∂of x g.
∇ ∇
F rom this we see that the gradient of
is the n ∂ym Jacobian matrix of g . a v ariable x can b e obtained by multiplying
where
a Jacobian matrix ∂x by a gradient ∇yz. The  back-propagation algorithm consists
F rom this we × that the gradient of a variable
see x can b e obtained
of p erforming sucsuchh a Jacobian-gradient pro product
duct for each op eration by
operation in m ultiplying
the graph.
a Jacobian matrix by a gradient z. The back-propagation algorithm consists
of pUsually
erforming wesuc
doh not apply the bac
a Jacobian-gradientback-propagation
k-propagation algorithm merely to vectors,
∇ pro duct for each op eration in the graph.
but rather to tensors of arbitrary dimensionalit
dimensionality y. Conceptually
Conceptually,, this is exactly the
Usually we do not apply the back-propagation
same as back-propagation with vectors. The only difference algorithmis merely
ho
howw thetonumvectors,
numb b ers
but rather to tensors of arbitrary dimensionalit y. Conceptually
are arranged in a grid to form a tensor. We could imagine flattening each tensor , this is exactly the
same
in
into as back-propagation with vectors. The only
to a vector b efore we run back-propagation, computing a vector-vdifference is ho w
vector-valued the num b
alued gradient,ers
are arranged
and in a gridthe
then reshaping to gradien
form a ttensor.
gradient We could
back into imagine
a tensor. In flattening
this rearrangedeach tensor
view,
into
bac a vector
back-propagation b efore we run back-propagation, computing
k-propagation is still just multiplying Jacobians by gradien a vector-v
gradients.ts. alued gradient,
and then reshaping the gradient back into a tensor. In this rearranged view,
To denote the gradient of a value z with resp ect to a tensor , we write ∇ z ,
respect
back-propagation is still just multiplying Jacobians by gradients.
just as if were a vector. The indices into no noww hahave
ve multiple co coordinates—for
ordinates—for
To denote
example, a 3-Dthe gradient
tensor of a value
is indexed z with
by three co resp ect to W
coordinates.
ordinates. a tensor , we write
e can abstract this awa
wayzy,
just
b as if a single
y using were avvector.
ariable The
i to indices
representintothe no w have m
complete ultiple
tuple of co ordinates—for
indices. ∇ all
For
example,
p a 3-Dtuples
ossible index tensori,is(∇indexed
z)i givbes
y three
gives ∂z co ordinates. We can abstract this away
∂ . This is exactly the same as how for all
by using a single variable i to represent the complete tuple of indices. For all
p ossible index tuples i, ( z) gives . This is exactly the same as how for all
206

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

∂z
p ossible integer indices i into a vector, (∇x z )i giv
into giveses ∂x . Using this notation, we
can write the chain rule as it applies to tensors. If = g ( ) and z = f ( ), then
p ossible integer indices i into a vector, ( z ) gives . Using this notation, we
X
can write the chain rule as it applies to tensors. ∂Ifz = g ( ) and z = f ( ), then
∇ z= (∇
∇ j) . (6.47)
∂ j
j ∂z
z= ( ) . (6.47)

∇ ∇

Using the chain rule, it is straigh straightforward


tforward
X to write do down
wn an algebraic expression for
the gradient of a scalar with resp respect
ect to an any y no
nodede in the computational graph that
Using
pro
produced the chain rule,
duced that scalar. Ho it is
Howevstraigh
wev
wever, tforward
er, actually ev to write
evaluating
aluating dothat
wn an algebraic in
expression expression
a computer for
the
in
intro
tro gradient
troduces of a scalar with
duces some extra considerations. resp ect to an y no de in the computational graph that
pro duced that scalar. However, actually evaluating that expression in a computer
intro Sp
Specifically
ecifically
ecifically,
duces some , many
extra sub subexpressions
expressions ma
considerations. may y be reprepeated
eated several times within the
overall expression for the gradien gradient.t. Any pro procedure
cedure that computes the gradien gradientt
Sp ecifically , many sub expressions
will need to choose whether to store these sub ma y be rep eated
subexpressions several times
expressions or to recompute them within the
o verall
sev
several expression for
eral times. An example of ho the gradien
how t. Any
w these rep pro
repeatedcedure
eated sub that
expressions arise is given int
subexpressionscomputes the gradien
will
Fig. need
6.9. toIn choose
some cases,whether to store the
computing thesesamesub expressions
sub
subexpression
expression or twice
to recompute
would simplythem
sev
b e eral times.F
wasteful. An
For example of ho
or complicated w thesethere
graphs, rep eated
can bsub expressions
e exp onentiallyarise
exponentially many is of
given
thesein
Fig. 6.9 . In some cases,
wasted computations, making a naiv computing the same
naivee implemen
implementationsub expression twice would
tation of the chain rule infeasible. simply
b e w asteful. F or complicated
In other cases, computing the same sub graphs, there
subexpressioncan
expression b etwice
exp onentially
could b e amany validofwaytheseto
w asted memory
reduce computations, making at
consumption a naiv
the ecostimplemen
of higher tation
run of the chain rule infeasible.
runtime.
time.
In other cases, computing the same sub expression twice could b e a valid way to
We first b egin by a version of the back-propagation algorithm that sp specifies
ecifies
reduce memory consumption at the cost of higher runtime.
the actual gradient computation directly (Algorithm 6.2 along with Algorithm 6.1
for theWe asso
firstciated
b egin forw
associated by aard
forward version of the back-propagation
computation), in the order it will algorithm
actuallythat sp ecifies
b e done and
the actual to
according gradient computation
the recursive directlyof(Algorithm
application chain rule.6.2One along witheither
could Algorithm 6.1
directly
for
p the asso
erform ciated
these forward computation),
computations in the order
or view the description of it
thewill actuallyasb ea done
algorithm sym
symb band
olic
according
sp
specification to the recursive application of chain rule.
ecification of the computational graph for computing the back-propagation. How- One could either directly
p
everform
er, thisthese
ever, computations
formulation do
doeses notormake
view explicit
the description of the algorithm
the manipulation and theasconstruction
a symb olic
sp ecification
of the sym
symb of the
b olic computational
graph that p erforms graph
the for computing
gradient the back-propagation.
computation. Such a formulation How-
ever,
is this formulation
presented below in do Sec.es not
6.5.6 make
, with explicit
Algorithmthe manipulation
6.5, where we andalso
thegeneralize
construction to
of
no
nodesthe sym b
des that conolic
containgraph that p erforms
tain arbitrary tensors. the gradient computation. Such a formulation
is presented below in Sec. 6.5.6, with Algorithm 6.5, where we also generalize to
First consider a computational graph describing ho how
w to compute a single scalar
no des that contain arbitrary tensors.
u(n) (sa(sayy the loss on a training example). This scalar is the quantit quantity y whose
gradienFirst consider
gradientt we wan a computational
wantt to obtain, with resp graph
respect describing ho w
ect to the n i input no to compute
des u to u (nscalar
nodes a
(1) single ) . In
u (say the loss on a training∂uexample). This scalar is the quantity whose
other words we wish to compute ∂u for all i ∈ {1 ,2 , . . . , ni } . In the application
gradient we want to obtain, with resp ect to the n input no des u to u . In
of back-propagation to computing gradients for gradien gradientt descent over parameters,
other
( n) words we
u will b e the cost assowish to compute
associated for all i
ciated with an example or1a,2minibatc , . . . , n h,
minibatch, . In the uapplication
while (1) to u (n )
of back-propagation
corresp
correspond to computing
ond to the parameters of the mo gradients
del. ∈ {
model. for gradien t descent
} o ver parameters,
u will b e the cost asso ciated with an example or a minibatch, while u to u
corresp ond to the parameters of the mo207 del.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

We will assume that the no nodes


des of the graph hav havee b een ordered in such a wa wayy
that we can compute their output one after the other, starting at u(n +1) and
Weupwill
going to assume
u(n). Asthat the in
defined noAlgorithm
des of the 6.1 graph
, eachav
each e de
h no b een
node u (i)ordered
is asso in such
associated
ciated a wa
with any
that
op we can
operation
eration f (i)compute their output
and is computed by ev one after the
evaluating
aluating thefunction
other, starting at u and
going up to u . As defined in Algorithm 6.1, each no de u is asso ciated with an
(i) aluating
op eration f and is computed byuev = f (A (i)the
) function (6.48)
A
where A (i) is the set of all no
nodes u =
des that aref (paren) ts of u (i).
parents (6.48)
A
where is the A setpro ofcedure
all no des
procedure that that are paren
p erforms thetscomputations
of u . mapping n i inputs
(1)
u to u (n ) ( n )
to an output u . This defines a computational graph where each no node
de
A pro cedure ( i ) that p erforms the
computes numerical value u by applying a function f to the set of argumen computations
( i ) mapping n inputs
arguments ts
u ( i ) to u to an output u
A that comprises the values of previous no . This defines a computational
( j ) graph where
des u , j < i, with j ∈ P a(u ). The
nodes each
( i ) no de
computes
input to numerical
the computational value ugraph by applying
is the a function
vector x , and fis settointo the the
set of argumen
first n i no nodests
des
A
u(1) to thatu (ncomprises
)
. The outputthe valuesof the of previous no des ugraph
computational , j<
is read offj thePlast
i, with a(u (output)
). The
input
no
nodede uto(n)the . computational graph is the vector x , and is set into the ∈ first n no des
u to u . The output of the computational graph is read off the last (output)
i = 1, . . . , ni
no deuu(i) ← . x
i
i = 1, . . . , n
u x
i = ni + 1, . . . , n
A(i) ← ← {u(j ) | j ∈ P a(u(i) )}
(i)= n + (i)1, .(.i). , n
A ← f (A )
u
u j P a(u )
A
u ← f({n) ( | )∈ }
u

u
That algorithm sp
specifies
ecifies the forw forwardard propagation computation, whic whichh we could
put in a graph G . In order to p erform back-propagation, we can construct a
That algorithm
computational graphspthat
ecifies
depthe
depends
ends forw
onard propagation
G and adds to itcomputation,
an extra set of whic
no h weThese
nodes.
des. could
put ina subgraph
form a graph B. withIn order
one no todep erform
node p er no
nodeback-propagation,
de of G. Computation we incan construct
B pro ceeds ina
proceeds
computational
exactly graph
the reverseG of that dep ends
the order on and adds
of computation into G, itand
an each
extrano
set
deofofnoBdes.
node These
computes
form a subgraph with one no de p er no de of . Computation in pro ceeds in
the deriv ativee ∂u
derivativ
ativ ∂u
asso ciated with Gthe forw
associated forward
ard graph no de u(i) . This is done
node
exactly the reverseB of the order of computation in G , and each no de of B computes
using the chain rule with resp ect to scalar output u(n) :
respect
the derivative asso ciated with the forwardGgraph no de u . This B is done
using the chain rule with ∂resp ( n )
u ect to scalar X output ( n)
∂ u ∂uu : ( i )
= (6.49)
∂ u(j ) ∂ u(i) ∂ u(j )
∂u i:j ∈P a(u ) ∂ u ∂u
= (6.49)
∂u ∂ u ∂ u
as sp ecified by Algorithm 6.2. The subgraph B con
specified contains
tains exactly one edge for each
edge from node u (j ) to no de u(i) of G . The edge from u (j ) to u (i) is asso
node associated
ciated with
as sp ecified by Algorithm
∂u 6.2. The subgraph contains exactly one edge for each
the computation of ∂u . In addition, a dot pro product
duct is p erformed for each no node,
de,
edge from node u to no de u of . X B from u to u (i)is asso ciated with
The edge
b et
etw
ween the gradient already computed with resp respect
ect to no des u that are children
nodes
the computation of . In addition, G a dot pro duct is p erformed for each no de,
b etween the gradient already computed 208 with resp ect to no des u that are children
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

of u (j ) and the vector con containing


taining the partial deriv atives ∂u
derivatives ∂u
for the same children
no ( i )
des u . To summarize, the amount of computation required for p erforming
nodes
of u and the vector containing the partial derivatives for the same children
the back-
back-propagation
propagation scales linearly with the nu mber of edges in G , where the
number
no des u . To
computation forsummarize,
each edge correspthe amount
correspondsonds to of computing
computation required
a partial forative
deriv p erforming
derivative (of one
the
no deback-
node withpropagation
resp
respect
ect to one scales linearly
of its paren
parents) with
ts) as the
wellnu asmber of edges
p erforming oneinmultiplication
, where the
computation
and one addition. for each
Belo
Below, edge
w, wecorresp ondsthis
generalize to computing
analysis to atensor-v
partialalued
tensor-valuedderiv ative
Gno des,(of
nodes, one
whic
whichh
no de with
is just a wa way resp ect to one of its paren ts) as
y to group multiple scalar values in the same no well as p erformingnode one m ultiplication
de and enable more
and
efficienone addition.
efficientt implemen Belo
implementations.
tations.w, we generalize this analysis to tensor-v alued no des, which
is just a way to group multiple scalar values in the same no de and enable more
efficient implemen tations. version of the bac
Simplified back-propagation
k-propagation algorithm for computing
( n)
the deriv
derivativativ
ativeses of u with resp respect
ect to the variables in the graph. This example is
intended to further understanding of
intended Simplified version bythe back-propagation
showing a simplified algorithm
case where forallcomputing
variables
the deriv ativ es of u with resp
are scalars, and we wish to compute the deriv ect to the v ariables
derivativ
ativ
atives in the
es with resp graph.
ect to u , . . . , u(n is
respect This (1)example ).
intended to further understanding
This simplified version computes the deriv b y showing
derivativativ a
atives simplified
es of all no case where all
des in the graph. The
nodes v ariables
are scalars, and we wish to compute
computational cost of this algorithm is prop the deriv ativ
ortional to the ect
proportional es with resp num tobuer of, . edges
umb . . , u in.
This simplified
the graph, version
assuming that computes
the partial thederiv
deriv
derivativ ativ
ativ
ative e es
assoofciated
all nowith
associated des in theedge
each graph. The
requires
computational cost of this algorithm
a constant time. This is of the same order as the num is prop ortional numbto the n um b er of
b er of computations foredges in
the graph, assuming that the partial derivative asso ciated with each edge requires
the forward propagation. Each ∂u ∂u
is a function of the paren parents ts u(j ) of u(i), thus
a constant time. This is of the same order as the numb er of computations for
linking the no nodes
des of the forward graph to those added for the back-propagation
the forward propagation. Each
graph. is a function of the parents u of u , thus
linking the no des of the forward graph to those added for the back-propagation
Run forward propagation (Algorithm 6.1 for this example) to obtain the activ activa-
a-
graph.
tions of the net network
work
Run forward
Initialize propagation (Algorithm
, a data structure6.1that for will
this store
example) to obtain
the deriv
derivativ
ativ
ativesesthe
thatactiv
hav
havea-
e
tions
b of the network
een computed. The entry _ [ u (i)] will store the computed value of
Initialize
∂u , a data structure that will store the derivatives that have
.
b∂u
een computed.(nThe entry _ [ u ] will store the computed value of
_ [∂ u ) ] ← 1
.
j = n − 1 do down
wn to 1
_ [∂ u ] 1 P
The next line computes ∂u = ∂u
i:j ∈P a(u ) ∂u ∂u
∂u
using stored values:
j = n 1 down ← to 1P ∂u ∂u
(j ) ] ←
_ line[ucomputes _ [u (i)] ∂u using stored values:
The next − =)
i:j ∈P a(u

_
{ ] [u (i)] | i = 1, . . . , n }_
_[u [u ]
i

_ [u ] i = 1, . . . , n
The back-propagation algorithm is Pdesigned to reduce the number of common
{ | }
sub
subexpressions
expressions without regard
P to memory
memory.. Sp Specifically
ecifically
ecifically,, it p erforms on the order
The back-propagation
of one Jacobian pro
product algorithm
duct p er no de in the graph.toThis
node is designed reduce
canthe b e nseen
umber of common
from the fact
sub expressions without
in Algorithm 6.2 that bacregard
backprop to memory . Sp ecifically
kprop visits each edge from no , it
nodep erforms
( j
de u to no) on
nodede u(order
the i) of
of one Jacobian pro duct p er no de in the graph. This can b e seen from the∂ufact
the graph exactly once in order to obtain the asso associated
ciated partial deriv derivativ
ativ
ativee .
in Algorithm 6.2 that backprop visits each edge from no de u to no de u ∂u of
Bac
Back-propagation
k-propagation thus av
avoids
oids the exp
exponential
onential explosion in rep repeated
eated sub subexpressions.
expressions.
the graph exactly once in order to obtain the asso ciated partial derivative .
Back-propagation thus avoids the exp onential209 explosion in rep eated sub expressions.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Ho
Howev
wev
wever,
er, other algorithms may b e able to avavoid
oid more sub
subexpressions
expressions by p erforming
simplifications on the computational graph, or ma may
y b e able to conserv
conservee memory by
Ho wev er, other algorithms may b e able
recomputing rather than storing some sub to avoid more
subexpressions.sub expressions by these
expressions. We will revisit p erforming
ideas
simplifications on the
after describing the bac computational
back-propagation graph, or may b
k-propagation algorithm itself.e able to conserv e memory by
recomputing rather than storing some sub expressions. We will revisit these ideas
after describing the back-propagation algorithm itself.

To clarify the ababov


ov
ovee definition of the bacback-propagation
k-propagation computation, let us consider
the sp
specific
ecific graph asso
associated
ciated with a fully-connected multi-layulti-layer
er MLP
MLP..
To clarify the ab ove definition of the back-propagation computation, let us consider
Algorithm 6.3 first shows the forw forward
ard propagation, which maps parameters to
the sp ecific graph asso ciated with a fully-connected multi-layer MLP.
the sup ervised loss L( y
supervised ˆ , y) asso
associated
ciated with a single (input,target) training example
(x, yAlgorithm 6.3 output
), with yˆ the first shows theneural
of the forward propagation,
netw ork when xwhich
network maps parameters
is provided in input. to
the sup ervised loss L( y ˆ , y) asso ciated with a single (input,target) training example
(x, y ), with yˆ the output sho
Algorithm 6.4 then shows
ws the
of the corresp
corresponding
neural netwonding
ork when computation
x is provided toinbeinput.
done for
applying the back-propagation algorithm to this graph.
Algorithm 6.4 then shows the corresp onding computation to be done for
Algorithm 6.3 and Algorithm 6.4 are demonstrations that are chosen to b e
applying the back-propagation algorithm to this graph.
simple and straightforw
straightforward ard to understand. How Howev ev
ever,
er, they are sp specialized
ecialized to one
sp Algorithm
specific
ecific problem. 6.3 and Algorithm 6.4 are demonstrations that are chosen to b e
simple and straightforward to understand. However, they are sp ecialized to one
Mo
Modern
sp ecific dern softw
software
problem. are implementations are based on the generalized form of back-
propagation describ
describeded in Sec. 6.5.6 b eloelow,
w, whic
whichh can accommo
accommodatedate any computa-
Mograph
tional dern softw are implementations
by explicitly manipulatingare based
a data on the generalized
structure for represen formsym
representing
ting of back-
symbb olic
propagation
computation. describ ed in Sec. 6.5.6 b elow, whic h can accommo date any computa-
tional graph by explicitly manipulating a data structure for representing symb olic
computation.

Algebraic expressions and computational graphs b oth op operate


erate on symb
symbols
ols
ols,, or
variables that do not hav havee spspecific
ecific values. These algebraic and graph-based
Algebraic
represen expressions
representations
tations and symb
are called computational
symbolic graphs b oth
olic representations. op erate
When on symbols
we actually use, or
or
vtrain
ariables that do
a neural netw not
network, hav e sp ecific v
ork, we must assign sp alues. These
specific algebraic and
ecific values to these sym graph-based
symb b ols. We
represen tations
replace a sym
symb are called symb
b olic input to the netwolic representations.
ork x with a sp
network When
specific we actually
ecific numeric value, suchuse asor
train a neural netw
[1.2, 3.765, −1.8]> . ork, we must assign sp ecific values to these sym b ols. W e
replace a symb olic input to the network x with a sp ecific numeric value, such as
[1.2Some
, 3.765approaches
, 1.8] . to back-propagation take a computational graph and a set
of numerical values for the inputs to the graph, then return a set of numerical
Some −
approaches to back-propagation
values describing the gradien
gradientt at those inputtake a computational
values. graph and
We call this approach a set
“symbol-
of numerical
to-n
to-number” values for the This
umber” differentiation. inputsis to
thethe graph, used
approach then by
return a setsuc
libraries of hnumerical
such as Torc
orchh
(vCollob
alues describing
ert et al., the
Collobert gradien
2011b ) andt Caffe
at those
(Jiainput values.
, 2013 ). We call this approach “symbol-
to-number” differentiation. This is the approach used by libraries such as Torch
Another approach is to take a computational graph and add additional no nodes
des
(Collob ert et al., 2011b) and Caffe (Jia, 2013).
to the graph that pro provide
vide a symbolic description of the desired deriv derivativ
ativ
atives.
es. This
Another approach is to take a computational graph and add additional no des
to the graph that provide a symbolic description of the desired derivatives. This
210
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.9: A computational graph that results in rep repeated


eated subsubexpressions
expressions when computing
the gradient. Let w ∈ R b e the input to the graph. We use the same function f : R → R
Figure
as the 6.9: A computational
operation that R graph
we apply atthat
every results
step in of rep eated sub
a chain: x =expressions
f( w), y =when
f ( x),computing
R f(yR).
z =
Tthe gradient. Let
o compute , wew applyb eEq.
the6.44
input and the graph. We use the same function f :
to obtain:
as the operation that ∈ we apply at every step of a chain: x = f( w), y = f ( x), z = → f(y ).
To compute , we apply Eq. 6.44 ∂ z and obtain:
(6.50)
∂w
∂∂zz ∂ y ∂ x (6.50)
= ∂w (6.51)
∂y ∂x ∂w
∂z ∂y ∂x
=f 0(y )f 0(x)f 0 (w ) (6.52)
(6.51)
∂y ∂x ∂w
0 0 0
=f (f (f (w )))f (f (w ))f (w ) (6.53)
=f (y )f (x)f (w ) (6.52)
Eq. 6.52 suggests an implemen =
implementationf ( f ( f ( w ))) f (f ( w ))f (w )
tation in which we compute the value of f (w ) only(6.53) once
and store it in the variable x . This is the approach taken by the back-propagation
Eq. 6.52 suggests
algorithm. an implemen
An alternative tationisinsuggested
approach which webcompute y Eq. 6.53 the value the
, where (w ) expression
of f sub only once
subexpression
fand
(w ) store
app it in
appears
ears thethan
more variable x
once. In . This is the approach
the alternativ
alternative e approac
approach, taken
h, f (w)byis the back-propagation
recomputed eac
each
h time
algorithm.
it is needed.AnWhenalternative approach
the memory is suggested
required to storebthe y Eq. 6.53of, where
value the sub expression
these expressions is low,
f (w ) app ears more than once. In the alternativ e approac h, f
the back-propagation approach of Eq. 6.52 is clearly preferable b ecause of its reduced( w ) is recomputed each time
it
run is needed.
runtime.
time. How When
However, the memory required to store the value of these
ever, Eq. 6.53 is also a valid implementation of the chain rule, and is useful expressions is low,
the back-propagation
when memory is limited. approach of Eq. 6.52 is clearly preferable b ecause of its reduced
runtime. However, Eq. 6.53 is also a valid implementation of the chain rule, and is useful
when memory is limited.

211
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Forward propagation through a typical deep neural netw network


ork and
the computation of the cost function. The loss L(yˆ, y ) dep ends on the output y
depends ŷˆ
F orward propagation through a typical
and on the target y (see Sec. 6.2.1.1 for examples of loss functions). To obtain thedeep neural netw ork and
the computation
total cost J , the loss of the
maycost function.
b e added to aThe loss L(yˆΩ(
regularizer , yθ) )dep
Ω(θ endsθon
, where conthe output
contains
tains yˆ
all the
and on the target
parameters (weigh
(weights yts(see
andSec.
biases).6.2.1.1 for examples
Algorithm of losshow
6.4 shows functions). To obtain
to compute gradien the
gradientsts
total cost
of J with respJ , the
respect loss may b e added to a
ect to parameters W and b. For simplicity regularizer Ω( θ ) , where θ con tains all
simplicity,, this demonstration uses the
parameters (weigh ts and biases).
only a single input example x. Practical applicationsAlgorithm 6.4 shows how to
should usecompute gradien
a minibatch. ts
See
of J with resp ect to parameters
Sec. 6.5.7 for a more realistic demonstration. W and b . F or simplicity , this demonstration uses
only a single input example x. Practical applications should use a minibatch. See
Net
Netw work depth, l
Sec. 6.5.7 for (ai),more i ∈ {1realistic
, . . . , l },demonstration.
the weigh
weightt matrices of the mo model
del
W
Net(i)w ork depth, l
b , i ∈ {1, . . . , l }, the bias parameters of the mo model
del
W ,i 1, . . . , l , the weight matrices of the mo del
x, the input to pro process
cess
b , i ∈ 1{, . . . , l ,}the bias parameters of the mo del
y , the target output
h (0) = xx, the∈input { to pro } cess
k = y1,, .the
. . , ltarget output
h a (k=) =x b(k) + W (k) h (k−1)
h (kk)==1f, .(a
. .(,kl))
a = b +W h
h h(l=
yˆ = ) f (a )

J = L(yˆ, y ) + λΩ(θ )
yˆ = h
J = L(yˆ, y ) + λΩ(θ )
is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012)
and TensorFlow (Abadi et al., 2015). An example of how this approach works
is the
is approach
illustrated in taken
Fig. 6.10 by Theano (Bergstraadv
. The primary etan
advan al.
antage, 2010
tage of ;this
Bastien
approac
approachet al.
h ,is2012
that)
and deriv
the TensorFlow
derivatives (Abadi
atives are et ed
describ
described al.,in2015
the).same
An example
languageofashow thethis approach
original works
expression.
is illustrated
Because the deriv in Fig.
derivativ
ativ
ativeses are just another computational graph, it is p ossible tothat
6.10 . The primary adv an tage of this approac h is run
the
bac derivatives areagain,
back-propagation
k-propagation describ ed in the same
differentiating language
the deriv ativesasinthe
derivatives original
order expression.
to obtain higher
Because
deriv
derivatives.the Computation
atives. derivatives areofjust another computational
higher-order deriv
derivativ
ativ es is graph,
atives describitedisinp ossible
described to run
Sec. 6.5.10 .
back-propagation again, differentiating the derivatives in order to obtain higher
We will use the latter approach and describ describee the bac
back-propagation
k-propagation algorithm in
derivatives. Computation of higher-order derivatives is describ ed in Sec. 6.5.10.
terms of constructing a computational graph for the deriv derivativ
ativ
atives.
es. Any subset of the
graphWemaywill then
use the
b e latter
ev approach
evaluated
aluated usingandsp describ
specific
ecific e the bacvk-propagation
numerical alues at a lateralgorithm
time. This in
terms
allo ws ofusconstructing
allows a computational
to avoid specifying graph eac
exactly when for hthe
each derivativshould
operation es. Anybsubset of the
e computed.
graph may then b e ev
Instead, a generic graph ev aluated using
evaluation sp ecific numerical
aluation engine can ev evaluate values
aluate every noat a
node later
de as sotime.
soon
on asThis
its
allo
parenws
parents’ us to av oid
ts’ values are avspecifying
available.
ailable. exactly when eac h operation should b e computed.
Instead, a generic graph evaluation engine can evaluate every no de as so on as its
The description of the symbol-to-symbol based approach subsumes the symbol-
parents’ values are available.
to-n
to-number
umber approac
approach. h. The symbol-to-num
symbol-to-numb b er approach can b e understo understood od as
The description of the symbol-to-symbol based approach subsumes
p erforming exactly the same computations as are done in the graph built by the the symbol-
to-n
sym
symb umber
b approac
ol-to-sym
ol-to-symb h. The symbol-to-num
b ol approach. The key differenceb er approach can sym
is that the b e understo
symb b ol-to-n od as
ol-to-number
umber
p erforming exactly the same computations as are done in the graph built by the
symb ol-to-symb ol approach. The key212 difference is that the symb ol-to-number
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Backwar
Backward d computation for the deep neural net network
work of Algo-
rithm 6.3, which uses in addition to the input x a target y. This computation
yields the gradients Backwar
on the d activ
computation
ations a(for
activations theeac
k) for deep
eachh la neural
layer
yer networkfrom
k, starting of Algo-
the
rithm 6.3
output lay ,
layer which uses
er and going bac in addition
backwards to the input
kwards to the first hidden layx a target
layer. y . This computation
er. From these gradients,
yields
whic
which the gradients on the activ ations
h can b e interpreted as an indication of how eac a for eac
each h
h laylayer
layer’s k, starting
er’s output shouldfrom the
change
output
to reducelayerror,
er andone goingcanbac kwards
obtain theto the first
gradien
gradient t onhidden layer. From
the parameters these
of eac
each gradients,
h lay
layer.
er. The
whic h
gradiencan
gradients b e interpreted
ts on weigh
weights as an indication of how each lay er’s
ts and biases can b e immediately used as part of a sto output should change
stoc chas-
to reduce error,
tic gradient up one
update
date (p can obtain
(performing the
erforming the up gradien
update t on the parameters of
date right after the gradients ha eac h layve bThe
haveer. een
gradien ts on weigh ts and biases can b e immediately
computed) or used with other gradient-based optimization metho used as part
methods. ds. of a sto chas-
tic gradient up date (p erforming the up date right after the gradients have b een
After the forward computation, compute the gradient on the output lay layer:
er:
computed) or used with other gradient-based optimization metho ds.
g ← ∇ŷ J = ∇ŷ L(yˆ, y )
After
k= thel , forward
l − 1, . . .computation,
,1 compute the gradient on the output layer:
g Con
Convert
vertJ =the L (yˆ, y ) t on the lay
gradien
gradient layer’s
er’s output in into
to a gradient into the pre-
← k∇= l , l ∇
nonlinearit
nonlinearity y1,activ
. . . ,ation
1
activation (elemen
(element-wise
t-wise multiplication if f is element-wise):
Con
g ←vert
∇a − the
J =gradien
g  f 0(taon (k) )the layer’s output into a gradient into the pre-
nonlinearitgradients
Compute y activation (elemen
on weigh
weightsts t-wise multiplication
and biases (including ifthef isregularization
element-wise): term,
g
where needed): J = g f (a )
Compute
∇ ←∇ gradients  on weights and biases (including the regularization term,
b J = g + λ∇ b Ω(θ )
where J
∇ needed):
= g h(k−1)> + λ∇ W Ω(θ )
W
J = gthe
Propagate + λgradients Ω(θ )w.r.t. the next lo lower-lev
wer-lev
wer-levelel hidden lay layer’s
er’s activ
activations:
ations:

g ← ∇h J = g h
J =W ∇ ( k+
) >λg Ω( θ )
Propagate
∇ the gradients w.r.t. ∇ the next lower-level hidden layer’s activations:
g J =W g
←∇

213
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.10: An example of the sym symbb ol-to-symbol approac


approach h to computing deriv
derivatives.
atives. In
this approach, the back-propagation algorithm do does
es not need to ever access any actual
Figure
sp ecific6.10:
specific numericAn example of the sym
values. Instead, b ol-to-symbol
it adds no des to aapproac
nodes h to computing
computational derivatives.ho
graph describing In
howw
thiscompute
to approach, thederiv
these back-propagation
derivativ
ativ
atives. algorithm
es. A generic graph ev doaluation
es not need
evaluation to ever
engine access
can later any actual
compute the
sp ecific
deriv numeric
derivatives
atives for anyvalues.
sp
specificInstead,
ecific numericit v
adds no(L
alues. des to In
(Left)
eft) a computational
this example, we graph describing
b egin how
with a graph
to compute
represen ting these
representing z = f (deriv
f( f (ativ
w)))es. A generic
. (Right)
))). graph
We run theevbac
aluation engine can
back-propagation
k-propagation later compute
algorithm, the
instructing
deriv
it to atives for any
construct sp ecificfor
the graph numeric values. (L
the expression eft) Inonding
corresp this example,
corresponding to wethis
. In b egin with a we
example, graph
do
represen ting z = f (
not explain how the bac f( f (w ))). (Right)
back-propagation W e run the bac k-propagation
k-propagation algorithm works. The purp algorithm,
purpose instructing
ose is only to illustrate
it to construct
what the desiredtheresult
graphis: forathe expression corresp
computational graphonding
with atosymbolic
. In this example, of
description wethe
do
not explain
deriv ative. how the back-propagation algorithm works. The purp ose is only to illustrate
derivative.
what the desired result is: a computational graph with a symbolic description of the
derivative.

214
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

approac
approach
h do
does
es not exp
expose
ose the graph.

approach do es not exp ose the graph.

The back-propagation algorithm is very simple. To compute the gradient of some


scalar z with resp ect to one of its ancestors x in the graph, we b egin by observing
respect
The
that the gradient withalgorithm
back-propagation resp
respect
ect tois zvery simple.
is given byTdzo compute the gradient of some
dz = 1. We can then compute
scalar
the z witht resp
gradien
gradient withect to ect
resp oneto
respect of each
its ancestors
parent of x zin in
thethe
graph,
graph weby
b egin by observing
multiplying the
that
curren the
current gradient
t gradien
gradient t bywith
the resp ect toofz the
Jacobian is given
op by that
operation
eration . W
= 1pro e canz then
produced
duced . We compute
contin
continue
ue
the gradien t with resp
multiplying by Jacobians tra ect to each
traveling
veling bacparent
backwards of z in the graph by multiplying
kwards through the graph in this wa way the
y until
curren
we reac t
reach gradien t by
h x. For any no the
node Jacobian of the op eration
de that may b e reached by going bac that pro duced z . W e contin
kwards from z through
backwards ue
tmwo
ultiplying
or morebpaths,
y Jacobians traveling
we simply sum bac
thekwards
gradientsthrough the from
arriving graphdifferent
in this wa y until
paths at
w e reac
that no x. For any no de that may b e reached by going backwards from z through
hde.
node.
two or more paths, we simply sum the gradients arriving from different paths at
thatMore
no de.formally
formally,, each no de in the graph G corresp
node corresponds
onds to a variable. To achiev
achievee
maxim
maximum um generality
generality,, we describ
describee this variable as b eing a tensor . T Tensor
ensor can
More formally
in general hav ,
havee an each
anyy num
numbno de in the graph corresp onds to a variable.
b er of dimensions, and subsume scalars, vectors, T o achiev
ande
maximum generality, we describ e this variable
matrices. G as b eing a tensor . Tensor can
in general have any numb er of dimensions, and subsume scalars, vectors, and
We assume that each variable is asso
matrices. associated
ciated with the following subroutines:

•We assume
_ that each( ):variable
This returnsis assotheciated
op with the
operation
eration following
that computes subroutines:
, repre-
sen
sented
ted by the edges coming in into
to in the computational graph. For example,
there_ may b e a Python
( ): This returns
or C++ the
class op eration the
representing thatmatrix
computes , repre-
multiplication
sen
• op ted by the
operation,
eration, andedges
the coming into function. in the computational
Supp ose we graph.
Suppose have a vFariable
have or example,
that
there may b e a Python or C++ class
is created by matrix multiplication, C = AB . Then representing the matrix_ multiplication
( )
op eration, and
returns a p oin the
ointer function.
ter to an instance of the corresp Supp
correspondingose we have
onding C++ class. a variable that
is created by matrix multiplication, C = AB . Then _ ( )
• returns_ a p ointer( to , Gan
): instance
This returns thecorresp
of the list of onding
variables
C++thatclass.
are children of
in the computational graph G .
_ ( , ): This returns the list of variables that are children of
•• ( , G ) : GThisgraph
in_the computational returns. the list of variables that are parents of
in the computational graph G . G
_ ( , ) : This returns the list of variables that are parents of
inhthe
•Eac
Each op computational
operation
eration G is alsograph
asso .
associated
ciated with a op
operation.
eration. This
op
operation
eration can compute a Jacobian-vector G pro
product
duct as describ
described
ed by Eq. 6.47. This
Eac h op eration is also asso ciated
is how the back-propagation algorithm is able to achiev with a op eration. This . Each
achievee great generality
generality.
op
op eration
operation can
eration is resp compute
responsible a Jacobian-vector
onsible for knoknowing
wing ho how pro duct as describ ed
w to back-propagate through the by Eq. 6.47 . This
edges in
is how the back-propagation algorithm is able to achiev e great
the graph that it participates in. For example, we might use a matrix multiplication generality . Each
op
op eration to
operation
eration is create
resp onsible for kno
a variable C= wing
ABho w to ose
. Supp
Supposeback-propagate through
that the gradient the edges
of a scalar z within
the graph
resp ect to that
respect C isitgiven
participates
by G . Thein. Fmatrix
or example, we might op
multiplication use a matrix
operation
eration multiplication
is resp
responsible
onsible for
op eration
defining tw to create
twoo bac a variable
back-propagation C = AB . Supp ose that the gradient
k-propagation rules, one for each of its input arguments. If w of a scalar z ewith
call
resp ect to C is given by G . The matrix multiplication op eration is resp onsible for
defining two back-propagation rules, one215 for each of its input arguments. If we call
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

the metho
method d to request the gradient with resp respectect to A giv givenen that the gradient
on the output is G , then the metho
method d of the matrix multiplication op operation
eration
the metho
must state that the gradiend to request the
gradientt with resp gradient
respect with resp ect to A giv en>that
ect to A is given by GB . Likewise, if we the gradient
on the
call the output is G
metho
method , then the
d to request the gradienmetho d oft the
gradient withmatrix
resp
respectectmultiplication
to B, then the op eration
matrix
m
op ust
operationstate that
eration is resp the
responsible gradien t with
onsible for implementing the resp ect to A is givenmethob
method y GB .
d and sp Likewise,
ecifyingifthat
specifying we
call desired
the the methoisd given
gradient to requestby A> the
G .gradien t with resp ect toalgorithm
The back-propagation B, then the matrix
itself do
does
es
op eration is resp onsible for implementing the
not need to know any differentiation rules. It only needs to call each op metho d and sp ecifying
operation’sthat
eration’s
the desired gradient is given
rules with the right argumen by A
arguments. G . The back-propagation
ts. Formally
ormally,, . (algorithm itself
, , ) must do es
not
return need to know any differentiation rules. It only needs to call each op eration’s
rules with the right X argumen
(∇ ts. . ( Formally ) i ), i , . ( , , )(6.54) must
return i
( .( ) ) , (6.54)
whic
which h is just an implemen implementation tation of the cchain hain rule as expressed in Eq. 6.47.
Here, is a list of inputs∇that are supplied to the op operation,
eration, is the
whic h is just an implemen
mathematical function that the op tation of
operation the c
eration implemenhain
implements, rule as expressed in
ts, is the input whose gradien Eq. 6.47t.
gradient
Here, is
we wish to compute, and X a list of inputs that are supplied to
is the gradient on the output of the op the op eration,eration. the
operation. is
mathematical function that the op eration implements, is the input whose gradient
The metho
method d should alwa always ys pretend that all of its inputs are distinct
we wish to compute, and is the gradient on the output of the op eration.
from each other, even if they are not. For example, if the op
operator
erator is passed
The metho
two copies of x to compute x , the d should
2 alwa ys pretendmetho
methodthat all of its inputs
d should still return are xdistinct
as the
from
deriv each
derivative other,
ative with resp even
respect if they are not.
ect to b oth inputs. The bac F or example, if
back-propagation the op erator
k-propagation algorithm will later is passed
tadd
wo copies of x to compute x , the
b oth of these arguments together to obtain 2x metho d, should
which still
is the return
correctx astotal
the
derivative
deriv ative with
derivative on x.resp ect to b oth inputs. The back-propagation algorithm will later
add b oth of these arguments together to obtain 2x, which is the correct total
Soft
Software
ware implemen
implementations tations of bac back-propagation
k-propagation usually pro provide
vide b oth the op opera-
era-
derivative on x.
tions and their metho
methods, ds, so that users of deep learning softw software
are libraries are
ableSoft to ware implementations
back-propagate through of graphs
back-propagation
built using usually
commonpro op vide
operationsb othlik
erations the
like op era-
e matrix
mtions and their exp
ultiplication, exponenonenmetho
onents, ds, so that and
ts, logarithms, userssoofon.deep learning
Soft
Software software who
ware engineers libraries
build area
able implementation
new to back-propagate through graphs built
of back-propagation using
or adv
advanced
anced common
users whoop erations
need tolikadde matrix
their
m ultiplication,
own op operation exp onen ts, logarithms, and
eration to an existing library must usually derive the so on. Soft ware engineers who
metho
method build
d fora
new
an
anyy newimplementation
op erations man
operations of back-propagation
manually
ually
ually.. or advanced users who need to add their
own op eration to an existing library must usually derive the metho d for
The back-propagation algorithm is formally describ described ed in Algorithm 6.5.
any new op erations manually.
In
The Sec. 6.5.2, we motiv
back-propagation motivatedated back-propagation
algorithm as a strategy
is formally describ for av
avoiding
ed in Algorithm oiding6.5 comput-
.
ing the same sub subexpression
expression in the chain rule multiple times. The naiv naivee algorithm
could In hav
Sec.e 6.5.2
have exp , we tial
exponen
onen
onential motivrun ated
runtime
time back-propagation
due to these rep as
repeateda strategy
eated sub for avoidingNow
subexpressions.
expressions. comput-
that
ing
we ha the
have same
ve sp sub
specified expression
ecified the bac in the
back-propagation chain rule m ultiple times.
k-propagation algorithm, we can understand its com- The naiv e algorithm
could hav e exp onen
putational cost. If we assume tial run time thatdue to eachthese
op rep eatedev
operation
eration sub expressions.
evaluation
aluation has roughlyNow that the
w e ha ve sp ecified the bac k-propagation algorithm,
same cost, then we may analyze the computational cost in terms of the number w e can understand its com-
putational
of op
operations
erations cost. If we assume
executed. Keep inthat mind each
hereopthat
eration evaluation
we refer to an op has roughly
operation
eration the
as the
same
fundamen cost,talthen
fundamental unitwof e may analyze the computational
our computational graph, which might cost inactually
terms of the nof
consist umber
very
of
man
manyop erations
y arithmetic opexecuted.
operations Keep in mind
erations (for example, we migh here that we
mightt hav refer to an op eration
havee a graph that treats matrix as the
fundamental unit of our computational graph, which might actually consist of very
many arithmetic op erations (for example, 216we might have a graph that treats matrix
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The outermost skeleton of the back-propagation algorithm. This


p ortion do doeses simple setup and clean cleanup up work. Most of the imp importan
ortan
ortantt work happ happensens
in the The outermost skeleton
subroutine of Algorithm 6.6 of the back-propagation algorithm. This
.p ortion do es simple setup and cleanup work. Most of the imp ortant work happ ens
in the T, the target subroutine
set of vof Algorithm
ariables whose 6.6gradients must b e computed.
.
G
T, the computational graph
z ,, the
the vtarget
ariableset to of b evariables
differentiated whose gradients must b e computed.
Let G 0 b e ,Gtheprunedcomputational
to contain only graph no des that are ancestors of z and descendents
nodes
of no
nodes z
G ,
des in T.the v ariable to b e differentiated
Let b e pruned to ,contain
Initialize T a data only no desasso
structure thatciating
are ancestors
associating tensors of toztheir
and descendents
gradients
of noGdes
_ inG .
[z ] ← 1
Initialize , a data structure asso ciating tensors to their gradients
in T
_ _ [z ]( , G1, G 0, _ )
T
in ←
Return _ ,
( , ,restricted _to T )
G G T
Return restricted to
multiplication as a single op operation).
eration). Computing a gradient in a graph with withn n no
nodes
des
will never execute more than O( n ) op 2 operations
erations or store the output of more than
Omultiplication
(n2 ) op
operations.
erations.as a Here
singlewe opareeration).
counting Computing
op
operations
erations a gradient in a graph with
in the computational n no des
graph, not
will never
individual op execute
operations more than O ( n ) op erations or
erations executed by the underlying hardware, so it is imp store the output of more
important than
ortant to
O (
rememn
rememb ) op erations.
b er that the run Here we
runtime are
time of eac counting
each h op op
operation erations
eration ma may in the computational graph,
y b e highly variable. For example, not
individual
multiplying tw op erations executed by the underlying
twoo matrices that each contain millions of en hardware,
entries so it is imp
tries might corresp ortant
ond to
correspond to
aremem
singleboper that theinrun
operation
eration thetime
graph. of eacWhe op caneration
see that macomputing
y b e highlythe variable.
gradient For example,
requires as
multiplying
most O(n2 ) op tw oerations
matrices
operations that each
b ecause the forwcontain
forward millions of enstage
ard propagation tries will
might at corresp ond to
worst execute
a single
all n no op eration
nodes
des in the in the graph.
original graphW(dep e can
(dependingsee that
ending on computing
which values thewegradient
wantt torequires
wan compute, as
most O (n ) op erations b ecause the forw ard propagation
we may not need to execute the entire graph). The back-propagation algorithm stage will at w orst execute
all n one
adds no des in the originalpro
Jacobian-vector graph
duct,(dep
product, whichending
should on which values with
b e expressed we wan O(1)t tonocompute,
nodes,
des, p er
w e may not need to execute the entire graph). The
edge in the original graph. Because the computational graph is a directed acyclic back-propagation algorithm
adds one
graph Jacobian-vector
it has at most O ( n2pro duct, Fwhich
) edges. or theshould
kinds of b egraphs
expressed are O
thatwith (1) no des,used
commonly p er
edge in the original graph. Because
in practice, the situation is even b etter. Most neural netwthe computational graph
network is a directed
ork cost functions are acyclic
roughly has at most O ( n )causing
graph it chain-structured, edges. back-propagation
For the kinds of graphs to ha
havethat
ve O (are commonly
n) cost. This isused
far
in practice, the situation
b etter than the naive approach, whic is even b etter.
which h migh Most neural netw
mightt need to execute exp ork cost
exponen
onenfunctions
onentially
tially man are
many y
roughly
no
nodes. chain-structured, causing back-propagation
des. This p otentially exponential cost can b e seen by expanding and rewritingto ha ve O ( n) cost. This is far
b etter than the
the recursive naive
chain ruleapproach,
(Eq. 6.49whic h might need
) non-recursiv
non-recursively: ely: to execute exp onentially many
no des. This p otentially exponential cost can b e seen by expanding and rewriting
X Y t
the recursive chain∂rule u (n) (Eq. 6.49) non-recursiv ely: ∂ u (π )
= . (6.55)
∂ u(j ) ∂ u (π )
∂u path (u ,u ,...,u ), k=2 ∂ u
= from π =j to π =n . (6.55)
∂u ∂u
Since the num numb b er of paths from no node de j to nonodede n can grow up to exp exponen
onen
onentially
tially in the

Since the numb er of paths from no de j to217


no de n can grow up to exp onentially in the
X Y
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The inner lo loop


op subroutine _ ( , G , G 0, _ ) of
the back-propagation algorithm, called by the bac back-propagation
k-propagation algorithm defined
in Algorithm 6.5The . inner lo op subroutine _ ( , , , _ ) of
the back-propagation algorithm, called by the back-propagation G G algorithm defined
, the variable whose gradient should b e added to G and .
in Algorithm 6.5.
G , the graph to mo modify
dify
dify..
G ,0 ,the
thevrestriction
ariable whose
of Ggradient
to no desshould
nodes b e added toin the
that participate andgradient. .
, the graph, atodatamo dify .
structure mapping nonodes
des to theirG gradients
, the restriction of to no des that participate in the gradient.
is inG
ReturnG _ , a[ data] structure
G mapping no des to their gradients
is in
Return
i← 1 _ [ ]
in _ ( , G0 )
i 1← _ ( )
← ←in __ ( , G , G( 0 ,, ) _ )
(i)
← _. ( (
_ ) G 0
( , G), , )

i← i+1 _ ( , , , _ )
← . ( G_G ( , ), , )
P
i← i← + 1(i) G
i
←_ [ ]=
Insert and the op erations creating it into G
operations

Return _ [ ] =
Insert and the op erations creating it into
Return G
length of Pthese paths, the num numb b er of terms in the ab aboove sum, whic
which
h is the num
numb b er
of such paths, can grow exp exponen
onen
onentially
tially with the depth of the forw forward
ard propagation
length of these paths, the num b er of terms in the ab
graph. This large cost would b e incurred b ecause the same computationove sum, which is the numbforer
of
∂u such paths, can grow exp onen tially with the depth of the forward propagation
∂u
would b e redone man many y times. T Too av
avoid
oid suc
such h recomputation, we can think
graph. This large cost would b e incurred b ecause the same computation for
of back-propagation as a table-filling algorithm that tak takes
es adv
advan
an
antage
tage of storing
would b e redone∂umany times. To avoid such recomputation, we can think
in
intermediate
termediate results . Eac
Each h nonode
de in the graph has a corresp
corresponding
onding slot in a
of back-propagation ∂u as a table-filling algorithm that takes advantage of storing
table to store the gradien
gradientt for that no node.
de. By filling in these table entries in order,
in
bactermediate
back-propagation
k-propagationresultsav
avoids
oids rep. Eac
repeating h no
eating de incommon
many the graphsub has a correspThis
subexpressions.
expressions. onding slot in a
table-filling
table to store
strategy the gradien
is sometimes t fordynamic
called that no de.
pr
pro By
ogr filling. in these table entries in order,
gramming
amming
amming.
back-propagation avoids rep eating many common sub expressions. This table-filling
strategy is sometimes called dynamic programming.

As an example, we walk through the bac back-propagation


k-propagation algorithm as it is used to
train a multila
multilayer
yer p erceptron.
As an example, we walk through the back-propagation algorithm as it is used to
Here we develop a very simple multila
multilayer
yer perception with a single hidden
train a multilayer p erceptron.
la
layer.
yer. To train this momodel,
del, we will use minibatc
minibatch
h stochastic gradient descen
descent.
t.
Here we develop a very simple multilayer perception with a single hidden
layer. To train this mo del, we will use218minibatch stochastic gradient descent.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The back-propagation algorithm is used to compute the gradien gradientt of the cost on a
single minibatch. Sp Specifically
ecifically
ecifically,, we use a minibatch of examples from the training
The back-propagation
set formatted algorithm
as a design matrix is used
X and to compute
a vector the gradien
of asso
associated
ciatedt ofclass
the cost
lab on ya.
labels
els
singlenetw
The minibatch.
network Sp ecifically
ork computes a lay ,erweofuse
layer a minibatch
hidden featuresofH examples
= max
max{ {from
0 , X the
W (1)training
} . To
set formatted as a design matrix X and a
simplify the presentation we do not use biases in this mo vector of asso
model. ciated class
del. We assume that lab elsour
y.
The
graphnetw ork computes
language includes aa layer op oferation
hiddenthat
operation features H = max
can compute max
max{ 0{, 0X W} elemen
,Z . Tt-
element- o
simplify
wise. Thethe presentation
predictions we do
of the not use biases
unnormalized log in this mo del.ov
probabilities Were{ assume
over classes arethat our
} then
graph
giv en blanguage
given y H W (2)includes
. We assumea op eration
that our graphthat can compute
language includesmax a 0, Z element-
wise.
op The
operation predictions of the unnormalized
eration that computes the cross-en
cross-entropy log
tropy b et ween the targets y andclasses
etween probabilities ov er } are then
{the probabilit
probability y
given by H W
distribution . Webyassume
defined that our graphlog
these unnormalized language includes
probabilities. The a resulting cross-
optropy
en erationdefines
entropy that computes
the cost theJMLEcross-en tropy b et
. Minimizing ween
this the targetsyypand
cross-entrop
cross-entropy the probabilit
erforms maxim
maximum umy
distribution
lik
likeliho
eliho defined bofy the
elihooo d estimation these unnormalized
classifier. How
However,logtoprobabilities.
ever, mak The resulting
makee this example cross-
more realistic,
en tropy defines the cost J . Minimizing
we also include a regularization term. The total cost this cross-entrop y p erforms maxim um
likeliho o d estimation of the classifier. However, to make this example  more realistic,
we also include a regularization term. 
X The(1)total 2 cost
X  
(2) 2
J = JMLE + λ  Wi,j + W i,j  (6.56)
i,j i,j
J =J +λ W + W (6.56)
consists of the cross-en
cross-entropy
tropy and a weight decay term with co coefficien
efficien
efficientt λ. The
computational graph is illustrated in Fig. 6.11.
consists of the cross-entropy and thea gradien
weight decay term withco efficient λ. The
The computational graph
computational graph is illustratedX for gradient
in Fig. t of this example is large enough that
6.11
  . X 
it would b e tedious to draw or to read. This demonstrates one of the b enefits
The computational graph
of the back-propagation algorithm, whic for the gradien
which t of this example is
h is that it can automatically large enough that
generate
it w
gradienould
gradients b e tedious to draw
ts that would b e straigh or to
straightforw
tforwread.
tforward This demonstrates
ard but tedious for a softw one
software of the b enefits
are engineer to
of
derivthe back-propagation
derivee man
manually
ually
ually.. algorithm, whic h is that it can automatically generate
gradients that would b e straightforward but tedious for a software engineer to
derivWeemancanually
roughly
. trace out the b eha ehavior
vior of the back-propagation algorithm
by lo looking
oking at the forw forward
ard propagation graph in Fig. 6.11. To train, we wish
W e can roughly
to compute b oth ∇ W J and trace out the
∇Wb eha J . vior
Thereof the
are back-propagation
tw
twoo different paths algorithm
leading
b y
bac lo
backwardoking
kward at
from the
J forw
to ard
the w propagation
eights: one graph
through inthe Fig. 6.11
cross-en .
cross-entropy T
tropyo train,
cost, we
and wish
one
to compute
through the weigh b oth J and
weightt decay cost. The weigh J . There are tw o different paths
weightt decay cost is relatively simple; it will leading
bac
alw kward
always
ays con from
contribute J to
∇ the(i) weights:
tribute 2λW to the gradient ∇ one through
on W (thei)
. cross-entropy cost, and one
through the weight decay cost. The weight decay cost is relatively simple; it will
The other path through the cross-en cross-entropy
tropy cost is slightly more complicated.
always contribute 2λW to the gradient on W .
Let G b e the gradient on the unnormalized log probabilities U (2) pro provided
vided by
the The other path through
op
operation. the cross-en
eration. The bac tropy cost
back-propagation is slightly
k-propagation algorithm no more complicated.
now w needs to
Let G
explore tw b e the
two gradient
o different branc on the unnormalized log probabilities
hes. On the shorter branch, it adds H >G
branches. U provided by
to the
the op eration. The bac k-propagation
gradientt on W (2), using the back-propagation rule for the second argument to
gradien algorithm no w needs to
explore
the matrixtwomultiplication
different branc hes. On The
operation. the other
shorter branch,
branch it adds
corresp
correspondsondsH G to
to the the
longer
cgradien t on W further
hain descending , using along
the back-propagation
the netw
network.
ork. First, rulethefor the second argument
back-propagation algorithmto
the matrix∇multiplication (2)>operation. The other branch corresp onds to the longer
computes H J = GW using the back-propagation rule for the first argument
ctohain descending further
the matrix multiplication op along the network.
operation.
eration. First,
Next, thethe back-propagation
op eration uses algorithm
operation its bac
back-
k-
computes J = GW using the back-propagation rule for the first argument
to the matrix 219
∇ multiplication op eration. Next, the op eration uses its back-
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.11: The computational graph used to compute the cost used to train our example
of a single-lay
single-layer
er MLP using the cross-entrop
cross-entropy
y loss and weigh
weightt deca
decay
y.
Figure 6.11: The computational graph used to compute the cost used to train our example
of a single-layer MLP using the cross-entropy loss and weight decay.
propagation rule to zero out comp components
onents of the gradien
gradientt corresp
corresponding
onding to entries
of U (1) that were less than 0. Let the result be called G 0 . The last step of the
propagation
bac
back-propagationrule toalgorithm
k-propagation zero out compis to onents
use theofbacthek-propagation
gradient corresp
back-propagation ruleonding
for the to second
entries
of U
argumen
argument that were less than
t of the op . Let the
0eration
operation result
to add X >G be0 to
called G . The on
the gradient lastWstep(1) of the
.
back-propagation algorithm is to use the back-propagation rule for the second
After these gradients hav havee b een computed, itGis the resp responsibilit
onsibilit
onsibility y of the gradien
gradientt
argument of the op eration to add X to the gradient on W .
descen
descentt algorithm, or another optimization algorithm, to use these gradients to
up After
update these gradients
date the parameters. hav e b een computed, it is the resp onsibilit y of the gradien t
descent algorithm, or another optimization algorithm, to use these gradients to
For the
up date the parameters.
MLP
MLP,, the computational cost is dominated by the cost of matrix
multiplication. During the forward propagation stage, we multiply by each weight
For resulting
matrix, the MLPin , the
O ( wcomputational
) multiply-adds,costwhereis w dominated
is the num
numb by
b er the
of wcost
eigh
eights. of During
ts. matrix
multiplication.
the backw
backward During the forward
ard propagation stage, we propagation
multiply stage,
by thewtranspose
e multiply of by eac
each
each weightt
h weigh
weight
matrix, resulting
matrix, which has O ( wsame
in the ) multiply-adds, where
computational w is The
cost. the num
main b ermemory
of weighcostts. During
of the
the backw ard propagation stage, we m ultiply by the transpose
algorithm is that we need to store the input to the nonlinearity of the hidden la of eac h weigh t
layer.
yer.
matrix,
This which
value has the
is stored same
from thecomputational cost. The
time it is computed un main
until
til the memory
backw
backward ardcostpassof has
the
algorithmto
returned is that we need
the same tot.store
p oin
oint. The the input cost
memory to theis nonlinearity
thus O(mnhof ), the
wherehidden
m islathe
yer.
This
numbervalue is stored in
of examples from
thethe time it and
minibatch is computed
nh is the num untilb er
numb theofbackw
hidden ard pass has
units.
returned to the same p oint. The memory cost is thus O(mn ), where m is the
number of examples in the minibatch and n is the numb er of hidden units.

220
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Our description of the bac back-propagation


k-propagation algorithm here is simpler than the imple-
men
mentations
tations actually used in practice.
Our description of the back-propagation algorithm here is simpler than the imple-
menAs notedactually
tations ab
abo ove,used
we hav
have e restricted the definition of an op
in practice. operation
eration to b e a
function that returns a single tensor. Most softw software implemen
are implementationstations need to
supp As
support noted
ort op ab
operations ov e, we hav e restricted the definition of
erations that can return more than one tensor. For example, if we an op eration to bwish
e a
function
to compute that
b othreturns a single tensor.
the maximum value in Most
a tensorsoftw
andaretheimplemen
index of tations needit to
that value, is
supp ort op erations that can return more
b est to compute b oth in a single pass through memory than one tensor. F or example,
memory,, so it is most efficien if we wish
efficientt to
to compute
implemen b
implementt this prooth the maximum
procedure v alue
cedure as a single op in a tensor
operation and the index
eration with two outputs. of that value, it is
b est to compute b oth in a single pass through memory, so it is most efficient to
We ha havve not describ
describeded hohoww to control the memory consumption of bac back-
k-
implement this pro cedure as a single op eration with two outputs.
propagation. Bac Back-propagation
k-propagation often inv involves
olves summation of man many y tensors together.
W e ha v e not
In the naive approac describ
approach, ed ho w to control the memory consumption
h, each of these tensors would b e computed separately of , bac
separately, thenk-
propagation.
all of them would Back-propagation
b e added in often
a secondinvolves
step.summationnaiveeofapproac
The naiv many tensors
approach h has an together.
overly
In the naive approac h, each of these
high memory b ottleneck that can b e avoided by maintensors would b e computed
maintaining separately
taining a single buffer , then
and
all of them would b e added in a second
adding each value to that buffer as it is computed. step. The naiv e approac h has an o verly
high memory b ottleneck that can b e avoided by maintaining a single buffer and
Real-w
Real-world
adding orld
each implementations
value to that buffer of as back-propagation
it is computed. also need to handle various
data types, such as 32-bit floating p oint, 64-bit floating p oin oint,
t, and integer values.
Real-w orld implementations
The p olicy for handling each of these typ of back-propagation
ypes
es tak
takes
es sp ecial care totodesign.
also
special need handle various
data types, such as 32-bit floating p oint, 64-bit floating p oint, and integer values.
Some op operations
erations ha have
ve undefined gradients, and it is imp important
ortant to track these
The p olicy for handling each of these typ es takes sp ecial care to design.
cases and determine whether the gradient requested by the user is undefined.
Some op erations have undefined gradients, and it is imp ortant to track these
Various other technicalities make real-world differentiation more complicated.
cases and determine whether the gradient requested by the user is undefined.
These technicalities are not insurmouninsurmountable,table, and this chapter has describ
describeded the key
in V arious
intellectual
tellectual to other
tools technicalities make
ols needed to compute deriv real-world
derivatives, differentiation
atives, but it is imp more
important complicated.
ortant to b e aw aware
are
These technicalities are
that many more subtleties exist. not insurmoun table, and this chapter has describ ed the key
intellectual to ols needed to compute derivatives, but it is imp ortant to b e aware
that many more subtleties exist.

The deep learning comm community


unity has been somewhat isolated from the broader
computer science communit
community y and has largely dev develop
elop
eloped
ed its own cultural attitudes
The deep learning comm unity
concerning how to p erform differen has been
differentiation. somewhat
tiation. More generallyisolated
generally, , the from theautomatic
field of broader
computer
differ science
differentiation communit y and has largely
entiation is concerned with how to compute deriv dev elop ed its
derivatives own cultural
atives algorithmically attitudes
algorithmically.. The
concerning
bac how to algorithm
back-propagation
k-propagation p erform differen
describtiation.
describeded hereMore generally
is only , the fieldto
one approach of automatic
automatic
differentiation
differen tiation.isItconcerned
differentiation. is a sp with
special
ecial how
case of to computeclass
a broader derivofatives algorithmically
techniques . The
called reverse
bac
mo
modek-propagation
de ac
accumulation
cumulation algorithm describ
cumulation.. Other approaches ev ed here
evaluate is only
aluate the sub one approach
subexpressions to automatic
expressions of the chain rule
differen tiation. It is a sp ecial case of a broader
in different orders. In general, determining the order of ev class of techniques
evaluation called
aluation that reverse
results in
modelo
the accumulation
lowest
west . Othercost
computational approaches evaluate
is a difficult the sub
problem. expressions
Finding of the chain
the optimal rule
sequence
in op
of different
operations
erationsorders. In general,
to compute determining
the gradient the order of (ev
is NP-complete aluation ,that
Naumann 2008results in
), in the
the lowest computational cost is a difficult problem. Finding the optimal sequence
of op erations to compute the gradient 221 is NP-complete (Naumann, 2008), in the
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

sense that it ma may y require simplifying algebraic expressions in into


to their least exp expensive
ensive
form.
sense that it may require simplifying algebraic expressions into their least exp ensive
For example, supp suppose ose we ha haveve variables p 1, p 2, . . . , pn represenrepresenting ting probabilities
form.
and variables z1 , z2 , . . . , zn represen
representing ting unnormalized log probabilities. Supp Supposeose
F
we defineor example, supp ose w e ha ve v ariables p , p , . . . , p represen ting probabilities
and variables z , z , . . . , z representingexp( unnormalized
exp(z zi) log probabilities. Supp ose
qi = P , (6.57)
we define i exp(
exp(zzi )
exp(z )
where we build the softmax function q =out of exp ,
exponentiation,
onentiation, summation and division (6.57)
exp(z ) P
op erations, and construct a cross-entrop
operations, cross-entropy y loss J = − i p i log qi . A human
where we build the softmax
mathematician can observe that the derivfunction out ofativeeonentiation,
derivativexp
ativ of J with resp summation
respectect to zi tak andesdivision
takes a very
op erations, and construct a cross-entrop y loss J
simple form: q i − pi. The back-propagation algorithm is not capable of simplifying= p log q . A human
mathematician
the gradientt thiscan
gradien way observe
ay, , and willthat the deriv
instead explicitly J with−resp
ative ofpropagate ect totsz through
gradien
gradients takes a all
very of
simple form: q andpexp . The back-propagation P
the logarithm exponen
onen
onentiation
tiation op erationsalgorithm
operations in the original is not capable
graph. Some of simplifying
softw
software
are
the gradien
libraries suct hthis
such as−w ay, and (will
Theano instead
Bergstra et explicitly
al., 2010;propagate
Bastien etgradien al., 2012 ts through
) are able all to
of
the logarithm P
p erform some and kindsexp ofonen tiationsubstitution
algebraic op erations in to the
improv
improveoriginal
e ov
over graph.
er the graphSomeprop softw
proposedare
osed
libraries suc h as Theano (
by the pure back-propagation algorithm. Bergstra et al. , 2010 ; Bastien et al. , 2012 ) are able to
p erform some kinds of algebraic substitution to improve over the graph prop osed
When the forward graph G has a single output no node de and eac each h partial deriv derivativ
ativ
ativee
b∂u
y the pure back-propagation algorithm.
can b e computed with a constant amount of computation, back-propagation
∂u When the forward graph has a single output no de and each partial derivative
guaran
guaranteestees that the number of computations for the gradient computation is of
the samecan border
e computed
as the num withb er
numb aG constant amount of
of computations forcomputation, back-propagation
the forward computation: this
guarantees that the number of computations for the gradient computation ∂u is of
can b e seen in Algorithm 6.2 b ecause eac each h lolocal
cal partial deriv derivative
ative ∂u needs
the same order as the numb er of computations for the forward computation: this
to b e computed only once along with an asso associated
ciated multiplication and addition
can the
for b e recursive
seen in Algorithm
chain-rule6.2 b ecause eac
formulation h lo6.49
(Eq. cal ).partial
The ov deriv
erallative
overall computation needs is
to b e computed
therefore O (# edges).only once How
Howev along
ev er, itwith
ever, can an asso ciated
p otentially b emultiplication
reduced by simplifying and addition the
for the recursive
computational chain-rule
graph constructedformulation (Eq. 6.49). The
by back-propagation, and ov erall
this is ancomputation
NP-complete is
therefore
task. O (# edges).
Implemen
ImplementationstationsHow suc
suchhevaser,Theano
it can pand otentially
TensorFlow b e reduced by simplifying
use heuristics basedthe on
computational
matc
matching
hing kno
known graph constructed by back-propagation, and
wn simplification patterns in order to iteratively attempt to simplify this is an NP-complete
task.
the Implemen
graph. We definedtationsbac suc h as Theano only
back-propagation
k-propagation and forTensorFlow
the computation use heuristics basedofon
of a gradient a
matc hing kno wn
scalar output but bac simplification
back-propagation patterns in order to iteratively
k-propagation can b e extended to compute a Jacobian (either attempt to simplify
the graph.
of k differen W e defined
differentt scalar no bac
desk-propagation
nodes in the graph,only or offora the computation
tensor-v
tensor-valued alued no nodeof
dea containing
gradient of ka
scalar output
values). A naiv but bac
naivee implemen k-propagation
implementation can b e extended to
tation may then need k times more computation: compute a Jacobian (either
for
of
eac
eachkh differen
scalar int ternal
scalar no
internal node
node desininthetheoriginal
graph,forward
or of a tensor-v
graph, the aluednaiv
naivenoe de containing k
implementation
vcomputes
alues). Aknaiv e implemen
gradien
gradients ts insteadtationof amaysinglethen need k When
gradient. times more the num computation:
numb b er of outputs for
eacthe
of h scalar
graphinternal
is larger no de in the
than the original
number forward
of inputs, graph, it isthe naive implementation
sometimes preferable to
computes k gradien ts instead
use another form of automatic differen of a single
differentiation gradient.
tiation called forwarWhen forwardthe
d mo num
mode b
de ac er of outputs.
accumulation
cumulation
cumulation.
oforward
F the graph
mo
modedeiscomputation
larger than has thebneenumberprop of
proposed
osedinputs, it is sometimes
for obtaining real-timepreferable
computation to
usegradients
of another in form of automatic
recurrent net
networks,
works,differen tiation called
for example (Williams forwar andd mo de ac
Zipser cumulation
, 1989 ). This.
Forward
also av mothe
avoids
oids de computation
need to storehas thebveen
aluesprop
and osed for obtaining
gradients for thereal-time
whole graph, computation
trading
of gradients in recurrent
off computational efficiency for memory net works, for example ( Williams
memory.. The relationship b etw and Zipser
etweeneen forward).mo
, 1989 This
modede
also avoids the need to store the values and gradients for the whole graph, trading
off computational efficiency for memory222 . The relationship b etween forward mo de
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

and backw
backward ard momodede is analogous to the relationship b etw etween
een left-multiplying versus
righ
right-multiplying
t-multiplying a sequence of matrices, such as
and backward mo de is analogous to the relationship b etween left-multiplying versus
right-multiplying a sequence of matrices, AB
ABC Csuch
D , as (6.58)
where the matrices can b e thought AB of asC Jacobian
D, matrices. For example,(6.58) if D
is a column vector while A has many rows, this corresponds to a graph with a
where
single theoutputmatrices
and mancanybinputs,
many e thought andof starting
as Jacobianthe m matrices. For example,
ultiplications from theifend D
is a column
and going backw vector
backwardsards onlyArequires
while has many rows, this corresponds
matrix-vector pro
products.
ducts. This to acorresp
graph onds
with to
corresponds a
single
the backw output
backward ard mo and
mode. man y inputs, and starting the m ultiplications
de. Instead, starting to multiply from the left would inv from the
involveend
olve a
and going backw ards
series of matrix-matrix pro only requires
products,
ducts, whic matrix-vector
which h mak
makes pro ducts. This
es the whole computation muc corresp
muchonds
h moreto
the
exp backw
expensiv
ensiv
ensive.e. ard
How
Howev mo
ev de.ifInstead,
ever,
er, A has few starting
fewerer ro
rowswsto multiply
than D hasfrom the left
columns, it iswcheaper
ould invtoolve
runa
series
the of matrix-matrix
multiplications pro ducts,
left-to-righ
left-to-right, which onding
t, corresp makes the
corresponding whole
to the computation
forward mo de. much more
mode.
A
exp ensive. However, if has fewer rows than D has columns, it is cheaper to run
the In many communities
multiplications left-to-righoutside of machine
t, corresp onding to learning,
the forward it is mo
more
de. common to
implemen
implementt differen
differentiation
tiation softw
softwareare that acts directly on traditional programming
In many
language co
code, communities
de, suc
suchh as Python outside
or Cofco machine
code, learning, it isgenerates
de, and automatically more commonprograms to
implemen
that t differen
different tiation
functions softwin
written arethese
thatlanguages.
acts directly on traditional
In the deep learning programming
communit
community y,
language co de, suc h as Python or C co de, and automatically
computational graphs are usually represented by explicit data structures created by generates programs
that
sp different
specialized
ecialized functions
libraries. written
The sp in theseapproach
specialized
ecialized languages. hasInthe
thedrawbac
deep learning
drawback communit
k of requiring y,
the
computational
library developer graphs are usually
to define the represented
metho
methods dsbyforexplicit
every op data structures
operation
eration createdthe
and limiting by
sp ecialized
user libraries.
of the library to The
onlysp ecialized
those op approach
operations
erations thathasha thebdrawbac
have
ve een defined.k of How
requiring
Howevever, the
ever, the
library
sp
specializeddeveloper to define the metho ds for every op
ecialized approach also has the b enefit of allowing customized back-propagation eration and limiting the
rules of
user tothe library to only
b e developed those
for each op op erations
operation,
eration, that hathe
allowing ve b een defined.
developer Howev
to impro
improveveer,
sp the
speed
eed
sp ecialized
or stability approach also has
in non-obvious wa the
ways b enefit
ys that an of allowing pro
automatic customized
procedure
cedure wouldback-propagation
presumably
rules to b e developed
b e unable to replicate. for each op eration, allowing the developer to impro ve sp eed
or stability in non-obvious ways that an automatic pro cedure would presumably
Bac
Back-propagation
k-propagation is therefore not the only wa way y or the optimal wa wayy of computing
b e unable to replicate.
the gradient, but it is a very practical method that con contin
tin
tinues
ues to serve the deep
Back-propagation
learning communit
community is therefore
y very well. Innot thethe only wa
future, y or the
differen optimal
differentiation
tiation tec wa y of computing
technology
hnology for deep
the
net gradient,
networks
works ma may but
y impro it
improve is a very practical method that
ve as deep learning practitioners b ecome more awcon tin ues to serve
aware the
are of adv deep
advances
ances
learning
in communit
the broader fieldyofvery well. Indifferentiation.
automatic the future, differentiation technology for deep
networks may improve as deep learning practitioners b ecome more aware of advances
in the broader field of automatic differentiation.

Some soft
software
ware framew
frameworks
orks supp
support
ort the use of higher-order deriv
derivatives.
atives. Among the
deep learning softw
software
are frameworks, this includes at least Theano and TensorFlo
ensorFlow.
w.
Some software framew orks supp ort the use of higher-order
These libraries use the same kind of data structure to describ derivatives. Among the
describee the expressions for
deep
deriv learning
derivatives softw are frameworks,
describee the original function bTheano
atives as they use to describ this includes at least and tiated.
eing differen TensorFlo
differentiated. w.
This
These
meanslibraries
that theuse
symthe
symb same
b olic kind of datamachinery
differentiation structure to
candescrib e the expressions
b e applied to deriv for
derivatives.
atives.
derivatives as they use to describ e the original function b eing differentiated. This
In the context of deep learning, it is rare to compute a single second derivderivative
ative
means that the symb olic differentiation machinery can b e applied to derivatives.
of a scalar function. Instead, we are usually interested in prop
properties
erties of the Hessian
In the context of deep learning, it is rare to compute a single second derivative
223 interested in prop erties of the Hessian
of a scalar function. Instead, we are usually
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

matrix. If we ha havve a function f : Rn → R, then the Hessian matrix is of size n × n.


In typical deep learning applications, R nRwill b e the number of parameters in the
matrix.
mo
model, If we ha v e a function
del, which could easily num f
numb:b er in the , then the Hessian
billions. The en matrix
entire
tire is of size
Hessian n n
matrix is.
In
th
thustypical
us deeptolearning
infeasible applications,
even represen
represent. n
t. → will b e the number of parameters in × the
mo del, which could easily numb er in the billions. The entire Hessian matrix is
Instead of explicitly computing the Hessian, the typical deep learning approac approach h
thus infeasible to even represent.
is to use Krylov methomethods ds
ds.. Krylo
Krylov v metho
methods ds are a set of iterative techniques for
Instead of
p erforming explicitly
various op computing
operations
erations likeethe
lik Hessian, the tin
approximately ypical
inv deep
verting learningorapproac
a matrix findingh
is to use
appro Krylov to
approximations
ximations metho ds. Krylo
its eigenv
eigenvectors v metho
ectors or eigends vare
eigenv alues,a set of iterative
without using techniques
any op for
operation
eration
p erforming v arious op
other than matrix-vector pro erations lik
products.
ducts. e approximately in v erting a matrix or finding
approximations to its eigenvectors or eigenvalues, without using any op eration
otherIn than
ordermatrix-vector
to use Krylovpro metho
methods
ducts.ds on the Hessian, we only need to b e able to
compute the pro product
duct b etw een the Hessian matrix H and an arbitrary vector v. A
etween
In
straigh order
straightforward to use
tforward tec Krylov
technique metho ds on ,the
hnique (Christianson 1992Hessian,
) for doing we only
so is need to b e able to
to compute
compute the pro duct b etween the Hessian h matrix H i and an arbitrary vector v. A
>
straightforward technique (H v = ∇x (∇
Christianson x f (x)))forvdoing
, 1992 . so is to compute (6.59)
Both of the gradien
gradientt computations
H v = in this( expression
f (x)) v .mamay
y b e computed automati-
(6.59)
cally by the appropriate soft software
ware library
library.
∇ ∇. Note that the outer gradien
gradientt expression
Both
tak
takes of the gradien t computations in this expression
es the gradient of a function of the inner gradien may b e computed
gradientt expression. automati-
cally by the appropriate software library. Note that the outer gradient expression
takesIf vthe
is gradient
itself a vector pro
produced
duced
of a function byhainner
of the computational graph, it is imp
gradienit expression. important
ortant to
sp
specify
ecify that the automatic differen
differentiation
tiation softw
software
are should not differentiate through
If v is itself
the graph that pro a vector
produced pro
duced v . duced by a computational graph, it is imp ortant to
sp ecify that the automatic differentiation software should not differentiate through
While computing the Hessian is usually not advisable, it is p ossible to do with
the graph that pro duced v .
Hessian vector pro ducts. One simply computes H e(i) for all i = 1, . . . , n, where
products.
While computing the Hessian(i)is usually not advisable, it is p ossible to do with
e(i) is the one-hot vector with e i = 1 and all other entries equal to 0.
Hessian vector pro ducts. One simply computes H e for all i = 1, . . . , n, where
e is the one-hot vector with e = 1 and all other entries equal to 0.
6.6 Historical Notes

6.6
F Historical
eedforward netw orks Notes
networks can b e seen as efficient nonlinear function appro approximators
ximators
based on using gradient descent to minimize the error in a function approximation.
From
F eedforward netw
this p oin
ointt oforks canthe
view, b emo
seen
modern
dernas feedforward
efficient nonlinearnetw
network function
ork approximators
is the culmination of
based
cen on
centuries using gradient descent to minimize the error
turies of progress on the general function approximation task.in a function approximation.
From this p oint of view, the mo dern feedforward network is the culmination of
The chain rule that underlies the back-propagation algorithm was inv inven
en
ented
ted
centuries of progress on the general function approximation task.
in the 17th cen
century
tury (Leibniz, 1676; L’Hôpital, 1696). Calculus and algebra ha have
ve
The chain rule that underlies the back-propagation algorithm
long b een used to solve optimization problems in closed form, but gradient descen was inv ented
descentt
in the 17th
was not in cen
intro
tro tury
troduced (Leibniz , 1676 ; L’Hôpital
duced as a technique for iterativ ,
iteratively1696 ).
ely appro Calculus
approximating and algebra
ximating the solution hato
ve
long b een used
optimization to solve until
problems optimization
the 19thproblems
cen tury (inCauc
century closed
Cauchy hy, form,
1847).but gradient descent
was not intro duced as a technique for iteratively approximating the solution to
Beginning in the 1940s, these function approximation techniques were used to
optimization problems until the 19th century (Cauchy, 1847).
motiv
motivate
ate machine learning mo models
dels suc
suchh as the p erceptron. How Howevev
ever,
er, the earliest
Beginning in the 1940s, these function approximation techniques were used to
motivate machine learning mo dels such224 as the p erceptron. However, the earliest
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

mo
models
dels were based on linear mo models.
dels. Critics including Marvin Minsky p ointed
out several of the fla flaws
ws of the linear mo modeldel family
family,, such as it inability to learn the
mo dels w ere based on linear mo dels.
XOR function, which led to a backlash against the Critics including Marvinnetw
entire neural Minsky
network p ointed
ork approach.
out several of the flaws of the linear mo del family, such as it inability to learn the
XOR Learning
function,nonlinear
which ledfunctions
to a backlashrequired
against thethe developmen
development
entire neuralt of netw
a multila
multilay yer p er-
ork approach.
ceptron and a means of computing the gradient through such a mo model.
del. Efficien
Efficientt
Learningofnonlinear
applications the chain functions
rule based required
on dynamic theprogramming
developmentb egan of a tomultila
app
appear
earyerinpthe
er-
ceptron
1960s andand a means
1970s, mostly offor
computing
con
control the gradient
trol applications through
(Kelley , 1960such a moand
; Bryson del. Denham
Efficient,
applications
1961 ; Dreyfusof, the1962chain rule based
; Bryson and Ho on, dynamic
1969; Dreyfus programming
, 1973) but b egan
alsotoforappsensitivity
ear in the
1960s and 1970s, mostly for con trol
analysis (Linnainmaa, 1976). Werbos (1981) prop applications ( Kelley
proposed , 1960 ; Bryson
osed applying these tec and Denham
hniques,
techniques
1961 ; Dreyfus
to training , 1962; neural
artificial Brysonnet and Ho, 1969
networks.
works. The; Dreyfus
idea was, 1973finally ) but also for in
developed sensitivity
practice
analysis ( Linnainmaa
after b eing indep
independently , 1976
endently redisco). W erbos
rediscov (1981
vered in differen) prop osed
differentt wa waysapplying
ys (LeCun, 1985; hniques
these tec Park
Parkerer
er,,
to training artificial
1985; Rumelhart et al. neural net works. The
al.,, 1986a). The b o ok Par idea
Paral al
allel was finally
lel Distribute
Distributed developed
d Pr
Pro in
ocessing presen practice
presented ted
after b eing indep endently redisco
the results of some of the first successful exp v ered in differen
experimen
erimen
eriments tts with back-propagation inera,
wa ys ( LeCun , 1985 ; Park
1985 ; Rumelhart
chapter (Rumelhart et al.et, 1986a ). The
al., 1986b b o okcontributed
) that Paral lel Distribute
greatlydtoPrthe ocessing presented
p opularization
theback-propagation
of results of some ofand the initiated
first successful
a very exp activ
activeerimen ts dwith
e p erio
eriod back-propagation
of researc
research h in multi-la
multi-layerin
yera
chapternetw
neural (Rumelhart
networks.
orks. Ho et
Howevweval.er,
wever, , 1986b ) thatput
the ideas contributed
forw
forward ard bygreatly to the p opularization
the authors of that b ook
of back-propagation and initiated
and in particular by Rumelhart and Hinton go muc a very activ e p erio
much d of researc h in multi-layer
h b eyond back-propagation.
neuralinclude
They networks. Howev
crucial ideas er,abthe
outideas
about the pputossibleforwcomputational
ard by the authors implemen of that
implementation tation b ook
of
and
sev in
several particular
eral central asp b y
aspects Rumelhart and Hinton
ects of cognition and learning, whic go muc h
which b eyond back-propagation.
h came under the name of
They include crucial ideas
“connectionism” b ecause of the imp ab out the
importancep ossible computational
ortance given the connections implemen
b etw
etweeneentation
neurons of
sev eral
as the lo central
locus asp ects of
cus of learning and memory cognition and learning, whic h came under
memory.. In particular, these ideas include the notion the name of
“connectionism” b ecause of the(Hin
of distributed representation impton
Hintonortance al.,given
et al. , 1986the ). connections b etween neurons
as the lo cus of learning and memory. In particular, these ideas include the notion
Following the success of back-propagation, neural net network
work researc
research h gained p op-
of distributed representation (Hinton et al., 1986).
ularit
ularityy and reacreached
hed a p eak in the early 1990s. Afterwards, other machine learning
techniques b ecamesuccess
F ollowing
techniques the more p ofopular
back-propagation,
until the mo modern neuraldeep
dern network researc
learning h gained that
renaissance p op-
ularit
b egany inand reached a p eak in the early 1990s. Afterwards, other machine learning
2006.
techniques b ecame more p opular until the mo dern deep learning renaissance that
Theincore
b egan 2006. ideas b ehind modern feedforward netw networks
orks havhavee not changed sub-
stan
stantially
tially since the 1980s. The same back-propagation algorithm and the same
The
approac
approaches coretoideas
hes gradien b ehind
gradient t descentmodern feedforward
are still in use. Most netwoforks
the hav e not
improv
improvement changed
ement sub-
in neural
stan
net tiallyp erformance
network
work since the 1980s. from The1986same
to 2015 back-propagation
can b e attributed algorithm
to tw
two and the First,
o factors. same
approacdatasets
larger hes to gradien
hav t descentthe
havee reduced are degree
still in touse.whichMoststatistical
of the improv ement in neural
generalization is a
net work p erformance
challenge for neural netw from
networks. 1986 to 2015 can
orks. Second, neural netw b e attributed
networks
orks hav to tw o
havee b ecome mucfactors.
much First,
h larger,
larger datasets
due to more p ow hav
owerful e reduced the degree
erful computers, and b etter softw to which
software statistical generalization
are infrastructure. How However, is aa
ever,
challenge
small for neural
number networks. changes
of algorithmic Second, ha neural
hav ve impro netwved
improved orksthehavpeerformance
b ecome muc ofh neural
larger,
due
net to more
networks
works p owerful
noticeably
noticeably. . computers, and b etter software infrastructure. However, a
small number of algorithmic changes have improved the p erformance of neural
One of these algorithmic changes was the replacement of mean squared error
networks noticeably.
with the cross-en
cross-entropy
tropy family of loss functions. Mean squared error was popular in
the One
1980sofand these algorithmic
1990s, but waschanges
gradually wasreplaced
the replacement of mean
by cross-entrop
cross-entropy squared
y losses anderror
the
with the cross-entropy family of loss functions. Mean squared error was popular in
the 1980s and 1990s, but was gradually225 replaced by cross-entropy losses and the
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

principle of maximum likelihoo likelihood d as ideas spread b etw etween


een the statistics comm communityunity
and the mac machine
hine learning communit community y. The use of cross-en cross-entropy
tropy losses greatly
principle
impro
improvedvedofthemaximum
p erformance likelihoo
of mod asdels
models ideaswithspread b etwand
sigmoid een the statistics
softmax outputs,commwhichunity
and previously
had the machine learning
suffered from communit
saturation y. and
The slowuse of cross-en
learning tropyusing
when lossesthegreatly
mean
impro ved the
squared error loss. p erformance of mo dels with sigmoid and softmax outputs, which
had previously suffered from saturation and slow learning when using the mean
The other ma major
jor algorithmic change that has greatly improv improved ed the p erformance
squared error loss.
of feedforward netw networks
orks was the replacemen
replacementt of sigmoid hidden units with piecewise
The other ma
linear hidden units, sucjor algorithmic
such h as rectified change that
linear has greatly
units. improvusing
Rectification ed thethe p erformance
max
max{ {0, z }
of feedforward
function was in netw
intro
tro orks
troduced w as the replacemen
duced in early neural netw t of
network sigmoid
ork mo models hidden units with
dels and dates back at least piecewise
linear
as far hidden units, such and
as the Cognitron as rectified
Neo linear units.
Neocognitron
cognitron Rectification
(Fukushima , 1975,using1980).theThese 0, z
max early
function
mo
models wasnot
dels did introuseduced in early
rectified neural
linear netw
units, orkinstead
but mo dels applied
and dates back at{ least
rectification }
to
as far as the
nonlinear Cognitron
functions. and Neo
Despite cognitron
the early (Fukushima
p opularity , 1975, 1980
of rectification, ). These early
rectification was
mo dels did not use rectified linear units, but instead
largely replaced by sigmoids in the 1980s, p erhaps b ecause sigmoids p erform b etter applied rectification to
nonlinear
when functions.
neural net worksDespite
networks are very thesmall.
early As p opularity
of the earlyof rectification,
2000s, rectified rectification
linear unitswas
largely
w replaced
ere avoided dueby tosigmoids
a somewhat in the 1980s,
sup
superstitiousp erhaps
erstitious b eliefb ecause sigmoids
that activ ation p
activation erform bwith
functions etter
when neural tiable
non-differen
non-differentiable networks p oin are
oints very small.
ts must b e av As of This
avoided.
oided. the earlyb egan2000s, rectified
to change in ablinear
out units
about 2009.
w ere avoided due to
Jarrett et al. (2009) observ a somewhat
observed sup erstitious b elief that activ ation
ed that “using a rectifying nonlinearity is the single most functions with
non-differen
imp
importan
ortan tiableinp oin
ortantt factor ts must b
improving e av
the p oided.
erformance This of b egan to changesystem”
a recognition in ab outamong 2009.
Jarrett
sev eral et
several al. (2009
differen
different ) observ
t factors ofed that “using
neural net
network
work a rectifying
arc
architecture
hitecture nonlinearity
design. is the single most
imp ortant factor in improving the p erformance of a recognition system” among
For differen
several small datasets,
t factors Jarrett
of neuraletnet al.work
(2009 ) hitecture
arc observed design.
that using rectifying non-
linearities is ev even
en more imp important
ortant than learning the weigh weights ts of the hidden la layers.
yers.
For small
Random weigh
weights datasets,
ts are sufficientJarretttoetpropagate
al. (2009)useful observed that using
information throughrectifying non-
a rectified
linearities
linear netw is
network, even
ork, allomore
wing imp
allowing the ortant
classifier than
lay
layerlearning
er at the top the toweigh
learnts how
of the to hidden
map differenlayers.
different t
Random weigh ts are sufficient
feature vectors to class identities. to propagate useful information through a rectified
linear network, allowing the classifier layer at the top to learn how to map different
When more data is available, learning b egins to extract enough useful kno knowledge
wledge
feature vectors to class identities.
to exceed the p erformance of randomly chosen parameters. Glorot et al. (2011a)
sho When
showed
wed more
that data is is
learning available,
far easier learning
in deep b egins to extract
rectified linearenough
netw
networks useful
orks thanknoinwledge
deep
to
net exceed
networks the
works that hav p erformance
havee curv
curvature of randomly chosen parameters.
ature or two-sided saturation in their activ Glorot ation functions.)
activation et al. ( 2011a
showed that learning is far easier in deep rectified linear networks than in deep
Rectified linear units are also of historical in interest
terest b ecause they show that
networks that have curvature or two-sided saturation in their activation functions.
neuroscience has con continued
tinued to ha have
ve an influence on the dev development
elopment of deep
Rectified linear units are also
learning algorithms. Glorot et al. (2011a) motiv of historical in
motivateterest b ecause
ate rectified linear they units
show fromthat
neuroscience has con tinued to ha ve an influence
biological considerations. The half-rectifying nonlinearity was in on the dev elopment
intended
tended to captureof deep
learning
these prop algorithms.
properties Glorot et al. ( 2011a ) motiv
erties of biological neurons: 1) For some inputs, biological ate rectified linear neurons
units from are
biological considerations.
completely inactive. 2) ForThe some half-rectifying
inputs, a biologicalnonlinearity
neuron’s was intended
output is propto ortional
capture
proportional
these
to its prop
input.erties of biological
3) Most of the time, neurons: 1) Forneurons
biological some inputs,
op eratebiological
operate in the regime neurons are
where
completely
they inactive.
are inactive 2) they
(i.e., For some
should inputs,
haveeasp
hav biological
sparse neuron’s
arse activations
activations). ). output is prop ortional
to its input. 3) Most of the time, biological neurons op erate in the regime where
theyWhen the mo
are inactive modern
dern they
(i.e., resurgence
should hav of deep
e sparselearning b egan). in 2006, feedforward
activations
net
networks
works contin
continued ued to havhavee a bad reputation. From about 2006-2012, it was widely
When the mo dern resurgence of deep learning b egan in 2006, feedforward
networks continued to have a bad reputation. 226 From about 2006-2012, it was widely
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

b eliev
elieved
ed that feedforward netw networks
orks would not p erform well unless they were assisted
by other mo models,
dels, such as probabilistic mo models.
dels. To day
day,, it is now known that with the
b eliev
righ ed that feedforward netw orks would
rightt resources and engineering practices, feedforw not p erform
feedforwardardwellnet unless pthey
networks
works werevery
erform assisted
well.
b y
To dayother mo dels,
day,, gradien such
gradient-based as probabilistic
t-based learning in feedforw mo dels.
feedforward T
ard neto day
networks , it is now known that
works is used as a tool to developwith the
righ t resources
probabilistic mo and
models, engineering practices,
dels, such as the variational autofeedforw
autoenco ard
enco
encoder net works p erform
der and generative adv very w
adversarialell.
ersarial
T
neto day , gradien
networks,
works, t-based
describ
described ed inlearning
Chapter in 20
feedforw
. Ratherard than
networksb eing is used
view
viewedas aastool
ed an to develop
unreliable
probabilistic
tec
technology
hnology that mo dels,
mustsuch as the
b e supp
supportedvariational
orted by otherauto enco der gradient-based
techniques, and generative adv ersarial
learning in
net works,
feedforw
feedforward describ
ard net ed
networks in Chapter
works has b een view 20 .
viewed Rather than b eing
ed since 2012 as a p ow view ed
owerful as an unreliable
erful technology that
tec
ma
may hnology that m ust b e supp
y b e applied to many other mac orted by
machine other techniques, gradient-based
hine learning tasks. In 2006, the communit learning in
community y
feedforw
used ardervised
unsup networks
unsupervised has bto
learning eensupp
view
support edsup
ort since 2012learning,
supervised
ervised as a p owand erfulno technology
now,
w, ironically that
ironically, , it
mamore
is y b e common
applied to to many
use sup other
supervisedmaclearning
ervised hine learning
to supp tasks.
support
ort unsupIn 2006,
unsupervised
ervisedthelearning.
community
used unsup ervised learning to supp ort sup ervised learning, and now, ironically, it
Feedforward netwnetworks
orks contin
continue
ue to ha
have
ve unfulfilled p otential. In the future, we
is more common to use sup ervised learning to supp ort unsup ervised learning.
exp
expect
ect they will b e applied to man many y more tasks, and that adv advances
ances in optimization
Feedforward
algorithms and mo netw
modeldelorks contin
design ue improv
will to havee unfulfilled
improve p otential.even
their p erformance In the future,
further. Thiswe
cexp ect they
hapter has will b e applied
primarily to man
describ
describeded ythemore tasks,
neural andork
netw
network that advances
family of mo in dels.
optimization
models. In the
algorithms
subsequen and mo del design will improv
subsequentt chapters, we turn to how to use these mo e their p erformance
models—ho
dels—ho
dels—how w to regularizeThis
even further. and
ctrain
hapter has
them. primarily describ ed the neural netw ork family of mo dels. In the
subsequent chapters, we turn to how to use these mo dels—how to regularize and
train them.

227
Chapter 7
Chapter 7
Regularization for Deep Learning
Regularization
A cen
central
tral problem in mac
machine
for Deep
hine learning is ho
how
w to mak
Learning
makee an algorithm that will
perform well not just on the training data, but also on new inputs. Man Manyy strategies
A cenintral
used mac problem
machine in mac
hine learning hine
are learningdesigned
explicitly is how to to reduce
make an thealgorithm
test error,that will
possibly
perform
at the exp well
expense
ensenotofjust on the training
increased training error.
data, but alsostrategies
These on new inputs.
are knownManycollectiv
strategies
collectively
ely
used in mac hine learning are explicitly designed to reduce the
as regularization. As we will see there are a great many forms of regularization test error, possibly
aatvailable
the exptoensetheof deep
increased training
learning error. These
practitioner. In strategies
fact, dev are known
developing
eloping morecollectiv ely
effective
as regularization.
regularization As wehas
strategies willbeen
see there
one ofaretheamagreat
major many
jor researc
researchforms
h effortsof regularization
in the field.
available to the deep learning practitioner. In fact, developing more effective
Chapter 5 introduced the basic concepts of generalization, underfitting, ov overfit-
erfit-
regularization strategies has been one of the ma jor research efforts in the field.
ting, bias, variance and regularization. If you are not already familiar with these
Chapter
notions, 5 introduced
please refer to that thecbasic
hapterconcepts of generalization,
before contin uing with this
continuing underfitting,
one. overfit-
ting, bias, variance and regularization. If you are not already familiar with these
In this
notions, chapter,
please referwto e describ
describe e regularization
that chapter before continin more
uing detail, focusing
with this one. on regular-
ization strategies for deep models or mo models
dels that ma may y be used as building blo blocks
cks
In this chapter,
to form deep models. w e describ e regularization in more detail, focusing on regular-
ization strategies for deep models or models that may be used as building blocks
Some sections of this chapter deal with standard concepts in machine learning.
to form deep models.
If you are already familiar with these concepts, feel free to skip the relev relevant
ant
Some
sections. Ho sections
How wev
ever, of this chapter deal with standard concepts in
er, most of this chapter is concerned with the extension of thesemachine learning.
basic concepts to thefamiliar
If you are already withcase
particular these concepts,
of neural netw feel
orks.free to skip the relevant
networks.
sections. However, most of this chapter is concerned with the extension of these
In Sec. 5.2.2, we defined regularization as “an “any y modification we mak makee to a
basic concepts to the particular case of neural networks.
learning algorithm that is intended to reduce its generalization error but not
In Sec. 5.2.2
its training , weThere
error.” defined areregularization as “any strategies.
many regularization modification we mak
Some put eextra
to a
learning
constrain
constraints algorithm
ts on a mac that
machine is intended to
hine learning model, sucreduce
suchh as adding restrictions on not
its generalization error but the
its trainingvalues.
parameter error.” SomeThereadd areextra
manyterms
regularization
in the ob strategies.
objectiv
jectiv
jective
e functionSome putcan
that extra
be
constrain
though ts on a mac hine learning
thoughtt of as corresponding to a soft constrain model, suc h as adding restrictions
constraintt on the parameter values. If chosen on the
parameter
carefully values. Some
carefully,, these extra constrain add
constraints extra terms in the
ts and penalties canob jectiv
lead toeimpro
function
improv that can be
ved performance
thought of as corresponding to a soft constraint on the parameter values. If chosen
carefully, these extra constraints and penalties can lead to improved performance
228

228
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

on the test set. Sometimes these constrain constraints ts and penalties are designed to encode
sp
specific
ecific kinds of prior kno knowledge.
wledge. Other times, these constrain constraints ts and penalties
are designed to express a generic preference for a simpler model class intoorder
on the test set. Sometimes these constrain ts and p enalties are designed encode to
sp ecific kinds of prior kno wledge. Other times, these
promote generalization. Sometimes penalties and constraints are necessary to makeconstrain ts and p enalties
areunderdetermined
an designed to express a generic
problem preference
determined. for aforms
Other simpler model class inknown
of regularization, order to as
promote
ensem
ensembleblegeneralization.
metho
methods, ds, com Sometimes
combine
bine multiple penalties
hyp
ypothesesand constraints
otheses that explain arethenecessary
trainingtodata.
make
an underdetermined problem determined. Other forms of regularization, known as
In the context of deep learning, most regularization strategies are based on
ensemble methods, combine multiple hypotheses that explain the training data.
regularizing estimators. Regularization of an estimator works by trading increased
biasInforthe context
reduced of deep An
variance. learning,
effective most regularization
regularizer strategies
is one that mak
makes esare based on
a profitable
regularizing
trade, reducingestimators.
varianceRegularization
significantly while of annotestimator
ov erly w
overly orks by trading
increasing the bias. increased
When
bias for reduced v ariance.
we discussed generalization and ov An effective regularizer
overfitting is one
erfitting in Chapter 5, we fo that mak es
focused a profitable
cused on three
trade, reducing v
situations, where the moariance significantly
model while not ov erly increasing
del family being trained either (1) excluded the bias.theWhen
true
w e discussed generalization
data generating process—corresp and
process—corresponding ov erfitting in Chapter 5 , w e
onding to underfitting and inducing bias, or (2)fo cused on three
situations,
matc
matchedhed thewhere the mo
true data del family
generating probcess,
eing or
process, trained either the
(3) included (1) generating
excluded the protrue
process
cess
data generating
but also manmany process—corresp onding to
y other possible generating processes—the ov underfitting and inducing
overfitting bias,
erfitting regime where or (2)
matc
v hed rather
ariance the true thandatabiasgenerating
dominatespro thecess, or (3) included
estimation error. The the generating
goal process
of regularization
but also
is to tak man
takee a mo y
modelother p ossible generating
del from the third regime in processes—the
into
to the second regime.ov erfitting regime where
variance rather than bias dominates the estimation error. The goal of regularization
In practice, an ov overly
erly complex mo modeldel family does not necessarily include the
is to take a model from the third regime into the second regime.
target function or the true data generating pro process,
cess, or even a close appro approximation
ximation
In practice, an ov
of either. We almost never haverly complex mo del family does not
havee access to the true data generating pronecessarily includecessthe
process so
target function
we can nev neverer knoor
know the true data
w for sure if the mo generating
model pro cess, or even a close
del family being estimated includes the appro ximation
of either. W
generating proe cess
almost
process never
or not. Howhav
However, e access
ever, most to the true data
applications of deep generating process so
learning algorithms
we can
are never kno
to domains w for
where thesure
trueifdata
the generating
model family pro being
process
cess estimated
is almost includes
certainly the
outside
generating
the mo
modeldel pro cess. or
family
family. not.learning
Deep However,algorithms
most applications
are typicallyof deep learning
applied to algorithms
extremely
are to domains where the true data generating pro cess
complicated domains such as images, audio sequences and text, for which the is almost certainly outside
true
the model family
generation pro
process . Deep
cess learning
essentially inv algorithms
involves
olves are typically
simulating the entire applied
universe.to extremely
To some
complicated
exten
extent,t, we are domains
alwa
alwaysys such
trying astoimages, audio psequences
fit a square eg (the data and generating
text, for which pro
process)the true
cess) into
generation pro cess essentially
a round hole (our model family). inv olves simulating the entire universe. T o some
extent, we are always trying to fit a square peg (the data generating process) into
What this means is that con controlling
trolling the complexity of the mo modeldel is not a
a round hole (our model family).
simple matter of finding the mo modeldel of the right size, with the right num numb ber of
What this means
parameters. Instead, we migh is that con trolling the complexity of the
mightt find—and indeed in practical deep learning scenarios, mo del is not a
simple
w e almostmatter
alw
alwa of do
ays finding the mo
find—that delbof
the estthe right
fitting mo size,
model
del (inwiththethe senseright number of
of minimizing
parameters. Instead, we migh t find—and indeed in practical
generalization error) is a large model that has been regularized appropriately deep learning scenarios,
appropriately. .
we almost always do find—that the best fitting model (in the sense of minimizing
We now review several strategies for ho how w to create such a large, deep, regularized
generalization error) is a large model that has been regularized appropriately.
mo
model.
del.
We now review several strategies for how to create such a large, deep, regularized
model.

229
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.1 Parameter Norm Penalties

7.1 Parameter
Regularization Norm
has been used Penalties
for decades prior to the adv
advent
ent of deep learning. Linear
mo
models
dels such as linear regression and logistic regression allow simple, straightforw
straightforward,
ard,
Regularization
and effectiv has b een used for decades
effectivee regularization strategies. prior to the adv ent of deep learning. Linear
models such as linear regression and logistic regression allow simple, straightforward,
Man
Many y regularization approac hes are based on limiting the capacit
approaches capacityy of mo
models,
dels,
and effective regularization strategies.
suc
suchh as neural netw
networks,
orks, linear regression, or logistic regression, by adding a pa-
Manynorm
rameter regularization
penalty Ω(
Ω(θapproac
θ ) to thehes
ob are based
objective
jective on limiting
function J . Wethe capacit
denote theyregularized
of models,
suc
ob h as neural netw orks,
jectivee function by J˜:
objectiv
jectiv linear regression, or logistic regression, by adding a pa-
rameter norm penalty Ω(θ) to the objective function J . We denote the regularized
ob jective function by J˜:J˜(θ; X , y) = J (θ; X , y) + αΩ(θ) (7.1)

where α ∈ [0, ∞) is a hyperparameterJ˜(θ; X , y) = J (that


θ; X ,wyeigh
) + ts
eights αΩ(theθ) relative con contribution
tribution (7.1) of
the norm penalty term, Ω, relative to the standard ob jectivee function J (x; θ).
objectiv
jectiv
where α [0 , ) is a hyperparameter that
Setting α to 0 results in no regularization. Larger values of w eigh ts the relative
α corresp conond
correspond tribution
to more of
the norm∈penalty
regularization. ∞ term, Ω, relative to the standard ob jective function J (x; θ).
Setting α to 0 results in no regularization. Larger values of α correspond to more
When our training algorithm minimizes the regularized ob jective function J
objective J̃˜ it
regularization.
will decrease both the original ob objectiv
jectiv
jectivee J on the training data and some measure
When our training algorithm
of the size of the parameters θ (or some subsetminimizes the regularized ob jective function
of the parameters). Differen
Different J˜ itt
cwill decrease
hoices for theboth the original
parameter normobΩjectiv
can eresult
J on inthedifferen
training
different data andbeing
t solutions somepreferred.
measure
of this
In the size of the
section, we parameters θ (or some
discuss the effects of the subset
variousofnormsthe parameters).
when used as Differenpenaltiest
choices
on for theparameters.
the model parameter norm Ω can result in different solutions being preferred.
In this section, we discuss the effects of the various norms when used as penalties
Before delving in into
to the regularization behavior of different norms, we note that
on the model parameters.
for neural net works, we typically cho
networks, ose to use a parameter norm penalty Ω that
hoose
Before delving
penalizes only the weigh in to the
weights ts of the affinebehavior
regularization of different
transformation norms,
at eac
eachh lay
layerweand
er note that
leav
leaves es
for neural net works, w e typically cho ose to use a parameter
the biases unregularized. The biases typically require less data to fit accurately norm penalty Ω that
penalizes
than the w only
eigh theEac
eights.
ts. weigh
Each ts oft the
h weigh
weight sp affine how
specifies
ecifies transformation
two variables at eac h layerFitting
interact. and leav es
the
the
weigh biases
eight t wellunregularized.
requires observing The biases
both vtypically
ariables require
in a variet lessydata
ariety to fit accurately
of conditions. Each
than
bias con the w
controls eigh ts. Eac h weigh t sp ecifies how two v ariables
trols only a single variable. This means that we do not induce to interact. Fitting
tooo muc the
uch h
w eigh t w ell
variance by lea requires
leaving observing b oth variables in a v ariet y of conditions.
ving the biases unregularized. Also, regularizing the bias parameters Each
bias
can in con
intro
trotrols
troduce only a singlet v
duce a significan
significant ariable.ofThis
amount means that
underfitting. Wewe do not induce
therefore use the to o mucwh
vector
variance
to indicate by all
leaving
of thethe biases
weigh
eights unregularized.
ts that Also, regularizing
should be affected by a normthe biasyparameters
penalt
enalty , while the
vcan
ectorintro
θ duce
denotes a significan
all of the t amount of underfitting.
parameters, including bW e therefore
oth w and the the vector w
useunregularized
to indicate all of the weights that should be affected by a norm penalty, while the
parameters.
vector θ denotes all of the parameters, including both w and the unregularized
In the con
parameters. context
text of neural net networks,
works, it is sometimes desirable to use a separate
penalt
enalty y with a different co α coefficien
efficien
efficientt for eac
eachh la
lay
yer of the netw network.
ork. Because it can
be expIn the
expensiv
ensivcon
ensivee totext
searcof hneural
search for thenet works,vit
correct is sometimes
alue of multiple desirable
hyp to use a separate
yperparameters,
erparameters, it is still
p enalt y with a different
reasonable to use the same weigh α co efficien t for
weightt deca
decayeac h lay er
y at all lalayof the netw ork. Because
yers just to reduce the searc it can
search h
b e
space.exp ensiv e to searc h for the correct value of multiple h yp erparameters, it is still
reasonable to use the same weight decay at all layers just to reduce the search
space. 230
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.1.1 L2 Parameter Regularization


L
7.1.1
W e ha
hav Parameter
ve already Regularization
seen, in Sec. 5.2.2, one of the simplest and most common kinds
of parameter norm penalty: the L2 parameter norm penalty commonly kno known
wn as
W e hav
weight dee already
deccay seen, in Sec. 5.2.2 , one
ay.. This regularization strategy driv of the
drives simplest
es the weigh and
weights most common kinds
ts closer to the origin 1
of parameter norm penalty:
by adding a regularization term Ω( the L parameter
Ω(θ 1 kwnorm
2
θ ) = 2 k2 to the ob penalty commonly
objectiv
jectiv kno wn
jectivee function. In as
weight de cay .
other academic commThis regularization
communities, 2 strategy
unities, L regularization drivises also
the known
weightsascloser
ridgeto the
regr origin
gression
ession or
by adding a regularization
Tikhonov regularization
gularization.. term Ω( θ ) = w to the ob jectiv e function. In
other academic communities, L regularization k isk also known as ridge regression or
W e can gain some
Tikhonov regularization. insight into the b ehavior of weigh
weightt decay regularization
by studying the gradient of the regularized ob objective
jective function. To simplify the
Wetation,
presen can gain
presentation, some insight
we assume no biasinto the behavior
parameter, so θ isofjust
weigh
w . tSuc
decay
Suchh a moregularization
model
del has the
by
follostudying
following the
wing total ob gradient
objective of the
jective function: regularized ob jective function. T o simplify the
presentation, we assume no bias parameter, so θ is just w . Such a model has the
α
following total objective J˜(w ; X , y) = w>w + J (w; X , y ),
function: (7.2)
2
α
with the correspondingJ˜parameter
(w; X , y) =gradient w w + J (w; X , y ), (7.2)
2
with the corresponding J˜(w; X , y)gradient
∇wparameter = αw + ∇w J (w; X , y). (7.3)
To tak gradienttJ˜step
takee a single gradien to, up
(w; X update
y) date
= αw the
+ weigh
eights,
J (ts,
w; w Xe, p
yerform
). this update:
(7.3)
∇wt ←
To take a single gradien step − up
w to (αdate ∇ww
w +the ∇
J eigh
(w; Xts,, w
y)) (7.4)
e p. erform this update:
Written another way
ay,, the
w update
w is:(αw + J (w; X , y)) . (7.4)
w←
Written another way, the ← (1 − is:)w − ∇∇w J (w; X , y).
− α
update α) (7.5)
We can see that the addition w (1of theα)w
weight deca
decay
J (ywterm
; X , yhas
). mo
modified
dified the learning
(7.5)
rule to multiplicativ
multiplicatively ely shrink the weigh
weight
← of−the weight t vector b y a constan
constant t factor
− ∇decay term has modified the learning on each step,
W e can see that the addition
just before performing the usual gradient up update.
date. This describ
describes es what happ
happensens in
arule to multiplicativ
single step. But what ely shrink
happenstheov weigh
over t vector
er the entire bcourse
y a constan t factor on each step,
of training?
just before performing the usual gradient update. This describes what happens in
We will further simplify the analysis by making a quadratic approximation
a single step. But what happens over the entire course of training?
to the ob objective
jective function in the neighborho
neighborhoo od of the value of the weigh weightsts that
W e will further simplify the analysis by making
∗ a
obtains minimal unregularized training cost, w = arg minw J (w). If the obquadratic approximation
objective
jective
to the obisjective
function function inasthe
truly quadratic, neighborho
in the od of the
case of fitting valueregression
a linear of the weigh
modelts with
that
obtainssquared
mean minimal unregularized
error, training
then the appro cost, wis =
approximation
ximation arg min J (w). If the objective
perfect.
function is truly quadratic, as in the case of fitting a linear regression model with
mean squared error,Jˆthen ∗ 1
(θ) =the appro
J (w (w − wis∗)>
) +ximation H (w − w ∗)
perfect. (7.6)
2
1
1
More generally, weJˆcould
(θ) =regularize
J (w ) +the (parameters
w w ) to Hbe (wnearwany ) specific point in (7.6)
space
2
and, surprisingly, still get a regularization effect,− but better results will be obtained for a value

closer to the true one, with zero being a default value that makes sense when we do not know if
the correct value should be positive or negative. Since it is far more common to regularize the
model parameters towards zero, we will focus on this special case in our exposition.

231
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

where H is the Hessian matrix of J with resp ect to w ev


respect aluated at w∗ . There is
evaluated
no first-order term in this quadratic approximation, because w∗ is defined to be a
where
minim H iswhere
minimum,
um, the Hessian matrix
the gradien
gradient of J withLikewise,
t vanishes. respect to w evaluated
because w∗ is theat w
lo . There
location
cation of is
a
no first-order
minim
minimum term in this quadratic approximation, because
um of J , we can conclude that H is positive semidefinite. w is defined to b e a
minimum, where the gradien t vanishes. Likewise, because w is the lo cation of a
The minim
minimumum of Jˆ occurs where its gradient
minimum of J , we can conclude that H is positive semidefinite.
The minimum of Jˆ occurs∇where ˆ
w J (w )its H (w − w ∗)
= gradient (7.7)

is equal to 0. Jˆ(w) = H (w w ) (7.7)


To study ∇ t deca −dify Eq. 7.7 by adding the weigh
is equal to 0. the effect of weigh
weight decay y, we momodify eightt
deca
decayy gradien
gradient.t. We can no
noww solv
solvee for the minim
minimumum of the regularized version of Jˆ.
We T o study
use the effect
the variable of represen
˜ to
w weight deca
represent t they,location
we modify Eq.minimum.
of the 7.7 by adding the weight
decay gradient. We can now solve for the minimum of the regularized version of Jˆ.
We use the variable w ˜ to represen
αw˜+ t the
H (w˜ − w∗ ) of
location = 0the minimum. (7.8)

(H + αI )w ˜ = Hw (7.9)
αw˜ + H (w ˜ w )=0 (7.8)
−1 ∗
w˜ = ( H +
(H + αI )w α I
− ) H
˜ = Hw w . (7.10)
(7.9)

As α approac hes 0, the w


approaches ˜ = (H + αsolution
regularized I ) H ww ˜. approac
w̃ hes w∗ . But (7.10)
approaches what
happ ens as α gro
happens ws? Because H is real and symmetric, we can decomp
grows? decompose ose it
intoAs
into α approac
a diagonal hes 0,Λ the
matrix andregularized solution
an orthonormal basisw˜ ofapproac
eigen hes w Q
eigenvectors,
vectors, . , But
suchwhat
such that
happ ens as>α grows? Because
H = QΛQ . Applying the decomp H is real
decomposition and symmetric, we
osition to Eq.7.10, we obtain: can decomp ose it
into a diagonal matrix Λ and an orthonormal basis of eigenvectors, Q, such that
H = QΛQ . Applyingw˜the QΛQ >osition
= (decomp + αI ) −1 toQEq.
ΛQ > ,∗we obtain:
7.10w (7.11)
h i−1
w˜ = (Q QΛ Q+ +
(Λ αIα)IQ)> QΛ QQΛQw> w∗ (7.11)
(7.12)
= QQ
(Λ I I)−1
(Λ++αα
> ∗
)QΛQ w
QΛ.Q w (7.13)
(7.12)

We see that the effect of w =eigh


Q(Λ
eight + αIy)is to
t deca
decay w . w ∗ along the axes defined
ΛQrescale
rescalew (7.13)
by
the eigen
eigenv
vectors of H . Sp Specifically
ecifically
ecifically,, the comp
componen
onen ∗
onentt of w that is aligned with the
iw
iW e see
-th that
eigenv the effect
eigenvector
ector of H is of rescaled
weigh
h t deca
by ay factor
is to irescale
of λiλ+α along
. (Y
(You the axes
ou may wishdefined by
to review
the
ho
how eigenv ectors of H . Sp ecifically , the comp onen
w this kind of scaling works, first explained in Fig. 2.3). t of w that is aligned with the
i-th eigenvector of H is rescaled by a factor of . (You may wish to review
Along the directions where the eigenv alues of H are relatively large, for example,
eigenvalues
how this kind of scaling works, first explained in Fig. 2.3).
where λi  α, the effect of regularization is relativ relatively
ely small. Ho Howwev
ever,
er, comp
componen
onen
onents
ts
Along the directions where
with λi  α will be shrunk to hav the eigenv alues of H are relatively large, for
havee nearly zero magnitude. This effect is illustrated example,
where λ
in Fig. 7.1. α , the effect of regularization is relatively small. However, components
with λ α will be shrunk to have nearly zero magnitude. This effect is illustrated
Only directions along which the parameters con contribute
tribute significan
significantly
tly to reducing
in Fig. 7.1
.
the obobjectiv
jectiv
jectivee function are preserved relativ relatively
ely in
intact.
tact. In directions that do not
con Only directions
contribute
tribute to reducing alongthewhich
ob the parameters
objective
jective function, con tribute
a small significan
eigenv
eigenvalue
alue oftlythe
to reducing
Hessian
the ob jectiv e
tells us that mov function
movement are preserved relativ ely intact. In directions
ement in this direction will not significantly increase the gradien that do not
gradient.
t.
contribute to reducing the ob jective function, a small eigenvalue of the Hessian
tells us that movement in this direction232 will not significantly increase the gradient.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

w∗


w2

w1

Figure 7.1: An illustration of the effect of L 2 (or weight decay) regularization on the value
of the optimal w. The solid ellipses represen representt contours of equal value of the unregularized
Figure
ob
objective.7.1: The
jective. An illustration
dotted circles of the
represent L (or wof
effect ofcontours eight decay)
equal valueregularization on the value
of the L2 regularizer. At
of the
the optimal
p oint w . The
˜ , these

w solid ellipses
competing ob represen
objectives
jectives t contours
reach of equal In
an equilibrium. value
the of thedimension,
first unregularized
the
ob jective.
eigen
eigenv value The dotted
of the circles
Hessian of Jrepresent
is small. contours
The ob of equal
objective
jective value of
function dothe
does L regularizer.
es not increase muc At
much h
the p oint
when w
moving˜ , these competing
horizon
horizontally
tally awa
way yobfrom
jectives
w .reach
∗ an equilibrium.
Because the ob jectiveInfunction
objective the firstdodimension,
does the
es not express
aeigen value
strong of the Hessian
preference along ofthisJ direction,
is small. The the ob jective function
regularizer do es not
has a strong increase
effect on thismuc h
axis.
when moving horizon tally awa y from w . Because the
The regularizer pulls w1 close to zero. In the second dimension, the ob ob jective function do es not
objective express
jective function
a strong
is preference
very sensitiv
sensitive e to moalong
mov this direction,
vements awa
wayy from thewregularizer
∗ . The corresphas aonding
strongeigenv
corresponding effectalue
on this
eigenvalue axis.
is large,
The regularizer
indicating high curv w close
pullsature.
curvature. As atoresult,
zero. weigh
In thet second
weight dimension,
decay affects the ob jective
the p osition function
of w2 relativ
relatively
ely
is very
little. sensitiv e to mo vements a wa y from w . The corresp onding eigenv alue is large,
indicating high curvature. As a result, weight decay affects the p osition of w relatively
little.

233
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Comp
Componenonen
onentsts of the weight vector corresponding to suc suchh unimportant directions
are deca
decay yed awaway
y through the use of the regularization throughout training.
Components of the weight vector corresponding to such unimportant directions
So far we hav havee discussed weight decay in terms of its effect on the optimization
are decayed away through the use of the regularization throughout training.
of an abstract, general, quadratic cost function. Ho How w do these effects relate to
mac So
machine far we hav e discussed w eight decay in terms
hine learning in particular? We can find out by studying of its effect on the regression,
linear optimizationa
of
mo an
del abstract,
model for whic
which hgeneral,
the true quadratic costisfunction.
cost function quadratic Ho
andw therefore
do these amenable
effects relate to
to the
machine
same kind learning in particular?
of analysis we hahave We can
ve used findApplying
so far. out by studying linearagain,
the analysis regression,
we willa
mo
b e del
ablefortowhic h the
obtain a true
sp costcase
special
ecial function
of theissame
quadratic andbut
results, therefore amenable
with the solutiontonow
the
same kind of analysis we ha ve used so far. Applying the analysis
phrased in terms of the training data. For linear regression, the cost function isagain, we will
be able
the sum to of obtain
squareda errors:
special case of the same results, but with the solution now
phrased in terms of the training data. For linear regression, the cost function is
the sum of squared errors: (X w − y )> (X w − y ). (7.14)
When we add L2 regularization,
(X wthe ob
objective
y )jective
(X w function
y ). changes to (7.14)
− −1
When we add L regularization,
(X w − y )the
> ob jective function
(X w − y ) + αw > w changes
. to (7.15)
2
1
(Xequations
This changes the normal w y ) (for y ) + αw
X wthe solution w. (7.15)
2 from
− − −1 >
This changes the normal equations X >the
w = (for X )solution
X y from (7.16)
to w = (X X ) X y (7.16)
w = (X >X + αI )−1X >y. (7.17)
to
The matrix X >X in Eq. 7.16 w =is(Xprop
proportional
Xortional
+ αI ) toXthey.cov
covariance
ariance matrix m 1
X > X.
(7.17)
  −1
Using L 2 regularization replaces this matrix with X >X + αI in Eq. 7.17.
The matrix X X in Eq. 7.16 is proportional to the covariance matrix X X .
The new matrix is the same as the original one, but with the addition of α to the
Using L regularization
diagonal. replaces
The diagonal entries of this
this matrix
matrix with
corresp XondXto+the
correspond αI variance
in Eq.of 7.17
eac
eachh.
The new
input matrixWise the
feature. can same as the
see that L2 original one, butcauses
regularization with the
the addition
learning of α to the
algorithm
diagonal.
to “p The diagonal entries of this matrix corresp
erceive” the input X as having higher variance, whic
“perceive” ond to
which the variance
h makes it shrink of eac
theh
input
weigh
eightsfeature.
ts We can
on features see cov
whose that L regularization
covariance
ariance causes
with the output the is
target learning algorithm
low compared to
to “p erceive” the input X as having higher v 
ariance, whic h makes it shrink the
this added variance.
weights on features whose covariance with the output target is low compared to
this added variance.
7.1.2 L1 Regularization
L
7.1.2 L 2 weigh
While Regularization
eight
t deca
decay
y is the most common form of weight deca
decay
y, there are other
ways to penalize the size of the mo del parameters. Another option is to use L 1
model
While L weight decay is the most common form of weight decay, there are other
regularization.
ways to penalize the size of the model parameters. Another option is to use L
ormally,, L1 regularization on the model parameter w is defined as:
Formally
regularization.
X
Formally, L regularization Ω(θon
) =the
||wmodel
|| 1 = parameter
|wi |, w is defined as: (7.18)
i
Ω(θ) = w234 = w , (7.18)
|| || | |

X
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

that is, as the sum of absolute values of the individual parameters.2 We will
no
noww discuss the effect of L1 regularization on the simple linear regression mo model,
del,
that is, as the sum of absolute v alues of the individual parameters.
with no bias parameter, that we studied in our analysis of L regularization. In2 W e will
now discusswethe
particular, areeffect
in of L in
interested
terested regularization
delineating the on differences
the simpleblinearet
etwweenregression
L1 and L2mo del,
forms
with no bias parameter, that
of regularization. As with L 2 weigh w e studied
eightt deca in our
y, L1 weigh
decay analysis
eightt deca
decayof L
y conregularization.
controls
trols the strengthIn
particular,
of we are interested
the regularization by scalingin delineating
the penalty theΩdifferences between
using a positive hypL erparameter
and L forms
yperparameter α.
of
Th regularization.
Thus, As with
us, the regularized ob L
objective w eigh t decay
˜ , L w eigh t deca
jective function J (w; X , y) is given by y con trols the strength
of the regularization by scaling the penalty Ω using a positive hyperparameter α.
Thus, the regularized ob J˜ (jective ) = α||wJ˜||(1w+
w; X , yfunction ;XJ (, w is ,given
y); X y), by (7.19)

with the correspondingJ˜gradient


(w; X , y(actually + J (w; X ,t):
) = α w, sub-gradien
(actually, sub-gradient):y), (7.19)
|| ||
∇w J˜(gradient
with the corresponding w; X , y)(actually
= αsign(, w
sub-gradien
) + ∇w J (Xt):, y; w) (7.20)

where sign(w) is simply J˜(w ; Xsign


the , y) of
=α wsign( w) +elemen
applied J (t-wise.
X , y; w)
element-wise. (7.20)
By sign(
insp
inspecting
ecting ∇
Eq. 7.20 ∇ that the effect of L 1 regu-
where w ) is simply the, sign
we can of wsee immediately
applied elemen t-wise.
differentt from that of L2 regularization. Sp
larization is quite differen Specifically
ecifically
ecifically,, we can
By insp ecting Eq. 7.20 , w e can see immediately
see that the regularization contribution to the gradient no longer that the effect of Llinearly
scales regu-
larization
with each is
w i;quite differen
instead it is tafrom
constant of L regularization.
that factor with a sign equal Specifically
sign((w, i)we
to sign can
. One
see that the regularization
consequence of this form ofcontribution gradientt to
the gradien is the
thatgradient no longer
we will not scalessee
necessarily linearly
clean
with each w ; instead it is a
algebraic solutions to quadratic appro constant factor
ximations of J (X , y; w) as we did forOne
approximations with a sign equal to sign (w ). L2
consequence
regularization. of this form of the gradien t is that we will not necessarily see clean
algebraic solutions to quadratic approximations of J (X , y; w) as we did for L
Our simple linear mo modeldel has a quadratic cost function that we can represen representt
regularization.
via its Taylor series. Alternately
Alternately,, we could imagine that this is a truncated Taylor
Our simple
series appro
approximatinglinear
ximating the mocost
del has a quadratic
function of a more cost function that
sophisticated mo we can
model.
del. Therepresen
gradientt
viathis
in its Taylor series.
setting is giv
givenAlternately
en by , we could imagine that this is a truncated Taylor
series approximating the cost function of a more sophisticated model. The gradient
in this setting is given by ∇w Jˆ(w) = H (w − w∗), (7.21)

where, again, H is the Hessian J ˆ(w) =ofHJ(w


matrix withwresp ), ect to w ev
respect w∗ .
aluated at(7.21)
evaluated
Because the L1 penalt
enalty ∇es not admit clean
y do
does − algebraic expressions in the case
where, again, H is the Hessian matrix of J with respect to w evaluated at w .
of a fully general Hessian, we will also mak makee the further simplifying assumption
thatBecause the L ispenalt y doesHnot
= admit
diag
diag([
([
([H
Hclean algebraic ]),,expressions hinHthe case
the Hessian diagonal, 1,1 , . . . , Hn,n ]) where eac
each i,i > 0.
of a fully
This general Hessian,
assumption we will
holds if the dataalso
for mak
the elinear
the further simplifying
regression problem assumption
has been
that
preprothe Hessian
preprocessed
cessed to remois diagonal,
remove H = diag ([
ve all correlation betw H
etween , . . . , H ]) , where
een the input features, whic eac h
which hHma
may >
y b0e.
This assumption holds
accomplished using PCA. if the data for the linear regression problem has b een
preprocessed to remove all correlation between the input features, which may be
2
As with L 2 regularization,
accomplished using PCA. we could regularize(othe parameters towards a value that is not
) 1
zero, but instead towards some parameter value
P w . In that case the L regularization would
(o)
introduce the term Ω(θ ) = ||w − w(o) || 1 = i |wi − w i |.

235
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Our quadratic appro ximation of the L1 regularized ob


approximation objective
jective function decom-
poses in into
to a sum over the parameters:
Our quadratic approximation of theL regularized objective function  decom-
poses inˆto a sum over the∗parameters: X 1
J (w; X , y) = J (w ; X , y) + Hi,i (wi − w ∗i )2 + α|wi | . (7.22)
2
i 1
Jˆ(w; X , y) = J (w ; X , y) + H (w w ) + α w . (7.22)
The problem of minimizing this approximate 2 cost function has an analytical solution
− | |
(for eac
eachh dimension i), with the follo following
wing form:
The problem of minimizing this approximate  cost functionhas an analytical solution
(for each dimension i), with the follo 
wing form: α 
wi = sign(
sign(w w∗i )Xmax |w∗i | − ,0 . (7.23)
H i,i
α
w = sign(∗ w ) max w ,0 . (7.23)
Consider the situation where wi > 0 for all i. There H are tw twoo possible outcomes:
| |−
Consider
1. w∗i ≤ theαsituation where w > 0 for all i. There are two possible objectiv outcomes:
Hi,i . Here the optimal value of wi under the regularized ob jectiv
jectivee is
 
simply wi = 00.. This occurs because the contribution of J (w ; X , y ) to the
1. w . Here the optimal value of w under the regularized ob jective is1
regularized ob jectivee J˜(w ; X , y ) is overwhelmed—in direction i—b
objectiv
jectiv —by y the L
simply≤ w = 0.whic
regularization This
which h o ccurs
pushes b ecause
the value the
of contribution
w to zero. of J (w ; X , y ) to the
i
˜
regularized ob jective J (w ; X , y ) is overwhelmed—in direction i—by the L
∗ > α
2. w Hi,i , herewhic
regularization
i the hregularization
pushes the value doesofnot w mov
move e the optimal value of wi to
to zero.
zero but instead it just shifts it in that direction by a distance equal to Hαi,i .
2. w > , here the regularization does not move the optimal value of w to
zero but instead it just shifts it in that direction by a distance equal w to .
A similar process happ happens ens when w ∗i < 0, but with the L1 penalt enaltyy making i less
negativee by Hαi,i , or 0.
negativ
A similar process happens when w < 0, but with the L penalty making w less
2 1
In comparison
negativ e by , orto0.L regularization, L regularization results in a solution that
is more sp sparse
arse
arse.. Sparsit
Sparsity y in this con
context
text refers to the fact that some parameters
ha
hav In comparison to L regularization,
ve an optimal value of zero. The sparsit L yregularization
sparsity results in
of L1 regularization is aa solution
qualitativ that
qualitativelyely
is more
differen sp
differentt beha arse . Sparsit y in this con text
2 refers to
vior than arises with L regularization. Eq. 7.13 gav
ehavior the fact that some parameters
gavee the solution w w̃
˜
ha v e 2an optimal v alue of zero. The sparsit y of L regularization
for L regularization. If we revisit that equation using the assumption of a diagonal is a qualitativ ely
differentH
Hessian beha viorwthan
that e introduced for Lourregularization.
arises with analysis of L 1 Eq. 7.13 gave the
regularization, wesolution
find that w˜
for

w˜i =L regularization.
Hi,i ∗ ∗ If we revisit that equationw̃ usingnonzero.
the assumption of a diagonal
Hi,i+α wi . If wi was nonzero, then w ˜ i remains This demonstrates
Hessian H that we introduced for our analysis of L regularization, we find that
that L2 regularization does not cause the parameters to become sparse, while L 1
w˜ = w . If w was nonzero, then w ˜ remains nonzero. This demonstrates
regularization ma may y do so for large enough α.
that L regularization does not cause the parameters to become sparse, while L
The sparsit
sparsity y property induced by L1 regularization has been used extensively
regularization may do so for large enough α.
as a fefeatur
atur
aturee sele
selection
ction mechanism. Feature selection simplifies a mac machine hine learning
The sparsit y property induced
problem by choosing which subset of the av b y L regularization
ailable features should bextensively
available has been used e used. In
as a featur e sele
particular, the well knoction mechanism.
known F eature selection simplifies
wn LASSO (Tibshirani, 1995) (least absolute shrink a mac hine learning
shrinkageage and
problem op
selection byerator)
choosing
operator) mo which
model
del in subset an
integrates
tegrates of Lthe
1 available features should b e used. In
penalt
enaltyy with a linear mo model
del and a least
particular,
squares costthefunction.
well known TheLASSO (Tibshirani
L 1 penalt
enalty y causes, a1995 ) (least
subset absolute
of the weights shrink
to bage and
ecome
selection op erator) mo del in tegrates an L p enalt y
zero, suggesting that the corresponding features may safely be discarded.with a linear mo del and a least
squares cost function. The L penalty causes a subset of the weights to become
zero, suggesting that the corresponding236 features may safely be discarded.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

In Sec. 5.6.1, we sa saw


w that man
many y regularization strategies can be in interpreted
terpreted as
MAP Ba yesian inference, and that in particular, L 2 regularization is equiv
Bay equivalent
alent
In Sec.
to MAP Ba 5.6.1
Bay , we saw that man
yesian inference with a P y regularization strategies
Gaussian prior on the weigh can b
weights. e in
ts. F terpreted
For 1
or L regu- as
MAP Bayesian
larization, the pinference,
enalt
enalty y αΩ(and
Ω(ww ) that
= α ini particular, L regularization
|w i| used to regularize a cost is function
equivalent is
to MAP
equiv
equivalen
alen Ba yesian inference with a Gaussian prior on
alentt to the log-prior term that is maximized by MAP Bay the weigh ts.
Bayesian F or L
esian inferenceregu-
larization,
when the is
the prior penalt y αΩ(w )Laplace
an isotropic = α distribution
w used to(Eq. regularize a cost
3.26) over w: function is
equivalent to the log-prior term that is| maximized
| by MAP Bayesian inference
X 1 X
when the prior is an isotropic Laplace distribution (Eq. 3.26) over w:
log p(w) = log Laplace(
Laplace(wwi ; 0, ) = −α |w i| + log α − log 2 (7.24)
α
i 1 i
log p(w) = log Laplace(w ; P 0, ) = α w + log α log 2 (7.24)
From the point of view of learning viaα minimization− | | with resp
respect
−ect to w , w e can
ignore the log α − log 2 terms because they do not dep depend
end on w .
From the point of view of learning via minimization with respect to w , we can
ignore the log α X log 2 terms because they do not X depend on w.
7.2 Norm −Penalties as Constrained Optimization

7.2 Norm
Consider the costPfunction
enalties as Constrained
regularized by a parameterOptimization
norm penalt
enalty:
y:

J˜(θregularized
Consider the cost function ; X , y) = J (bθy; X
a parameter
, y) + αΩ(θnorm
). penalty: (7.25)
˜(θ; X , y) = J (θ; X , y) + αΩ(θ). (7.25)
Recall from Sec. 4.4 Jthat we can minimize a function sub subject
ject to constrain
constraintsts by
constructing a generalized Lagrange function, consisting of the original ob objectiv
jectiv
jectivee
Recall from Sec. 4.4 that we
function plus a set of penalties. Eaccan
Eachminimize a function
h penalty is a pro sub
product ject
duct betetw to constrain ts by
ween a coefficient,
constructing
called a generalizeduck
a Karush–Kuhn–T
Karush–Kuhn–Tuck Lagrange
ucker
er (KKT) function, consisting
multiplier, and aof function
the original ob jective
representing
functionthe
whether plusconstrain
a set oftpisenalties.
constraint satisfied.Eac
If hwepenalty
wantedistoa constrain
product bΩ(etw
θ)een
Ω(θ a ecoefficient,
to b less than
called a Karush–Kuhn–T
some constan uck er (KKT) multiplier, and a
constantt k, we could construct a generalized Lagrange functionfunction representing
whether the constraint is satisfied. If we wanted to constrain Ω(θ) to be less than
some constant k, we could construct a generalized Lagrange function
L(θ, α; X , y) = J (θ; X , y) + α(Ω(θ) − k). (7.26)

(θ,constrained
The solution to the α; X , y) = Jproblem
(θ; X , yis
) +giv
αen
(Ω(by
given θ) k). (7.26)
L −
θ ∗ = arg problem
The solution to the constrained min maxisLgiv(θ,enα).by (7.27)
θ α,α≥0
θ = arg min max (θ, α). (7.27)
As described in Sec. 4.4, solving this problem L requires modifying b oth θ and
α . Sec. 4.5 pro
provides
vides a work
orked 2
ed example of linear regression with an L constrainconstraint.
t.
Man
ManyAs described
y different proin Sec.
procedures4.4 , solving this problem
cedures are possible—some ma mayrequires modifying b oth θ
y use gradient descent, while and
α . Sec.may
others 4.5 pro
usevides a worked
analytical example
solutions forofwhere
linear the
regression
gradienwith
gradientt is an L constrain
zero—but t.
in all
Man
pro y different
procedures
cedures proincrease
α must cedures are possible—some
whenev
whenever
er Ω(
Ω(θ maydecrease
θ ) > k and use gradient
whenevdescent,
wheneverer Ω(
Ω(θθ)while
< k.
others may use analytical
All positive α encourage Ω( solutions for where the gradien t
∗ is zero—but
θ) to shrink. The optimal value α will encourage Ω(
Ω(θ in all
θ)
Ω(θ
pro cedures α m ust increase whenev
to shrink, but not so strongly to mak er Ω( θ ) > k and decrease
makee Ω(θ) become less than k. whenev er Ω( θ) < k.
All positive α encourage Ω(θ) to shrink. The optimal value α will encourage Ω(θ)
to shrink, but not so strongly to make 237 Ω(θ) become less than k.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

To gain some insigh


insightt into the effect of the constrain t, we can fix α ∗ and view
constraint,
the problem as just a function of θ :
To gain some insight into the effect of the constraint, we can fix α and view
the problem asθ ∗just
= arg min L(θ,of
a function α∗θ): = arg minJ (θ; X , y) + α ∗Ω(θ). (7.28)
θ θ
θ = arg min (θ, α ) = arg minJ (θ; X , y) + α Ω(θ). (7.28)
This is exactly the same as the regularized training problem of minimizing J˜.
L
We can th thusus think of a parameter norm penalty as imp imposing
osing a constrain
constraintt on the
This
weigh is
eights.
ts. exactly
If Ω is the
the Lsame
2 norm, as the
then regularized
the w eights training
are problem
constrained of lie
to minimizing
in a L2 ball. J˜.
W
If eΩcan thus
is the L1think
norm,ofthen a parameter
the weights norm arepconstrained
enalty as imp toosing
lie inaaconstrain
region oft limited
on the
w
Leigh
1
norm. Ω is theweL donorm,
ts. IfUsually not kno then
knoww thethesizeweights
of theare constrained
constrain
constraint t region thatinwae Limpose
to lie ball.
If Ω is the L norm,
by using weight decay with co then the w eights are constrained
efficient α∗ because the value of α ∗ do
coefficient to lie in a region
does of limited
es not directly
L norm. Usually w e do
tell us the value of k . In principle, not kno w one
the can
size solv
of
solveethe constrain
for k , but t
the region that
relationship webimpose
etw
etweeneen
b y using
k and α dep ∗ w eight decay with co efficient α
ends on the form of J . While we do not kno
depends b ecause the v alue
know of α do es not
w the exact size of the directly
tell us thet vregion,
constrain
constraint alue of w ke. In
canprinciple,
con trol itone
control can solv
roughly bye increasing
for k , but the relationship
or decreasing α inbetw een
order
k and
to gro
grow α dep ends on
w or shrink the constrainthe form of J . While w e do not kno w the
constraintt region. Larger α will result in a smaller constrain exact size of
constraint thet
constrain
region. t region,
Smaller αw e can
will control
result in a itlarger
roughly t region.or decreasing α in order
by increasing
constrain
constraint
to grow or shrink the constraint region. Larger α will result in a smaller constraint
Sometimes we may wish to use explicit constraints rather than penalties. As
region. Smaller α will result in a larger constraint region.
describ
described ed in Sec. 4.4, we can modify algorithms suc suchh as stostochastic
chastic gradient descen descentt
Sometimes w e may wish
to take a step downhill on J (θ ) and then proto use explicit constraints
ject θ bac
project rather
back than p enalties.
k to the nearest point As
describ
that ed in Sec.
satisfies Ω(
Ω(θ θ4.4
) <, we can modify
k . This can bealgorithms
useful if wsuc h as
e hav
have stoidea
e an chastic gradient
of what valuedescenof kt
to appropriate
is take a step and downhill
do notonwan J (θt)toand
want sp thentime
spend
end pro ject θ back
searching fortothe thevalue
nearestof α pthat
oint
that
corresp satisfies
correspondsonds toΩ(this θ ) <k.k . This can be useful if we have an idea of what value of k
is appropriate and do not want to spend time searching for the value of α that
Another reason to use explicit constraints and repro reprojection
jection rather than enforcing
corresponds to this k.
constrain
constraints ts with p enalties is that p enalties can cause non-conv
non-convex ex optimization
pro Another
procedures reason
cedures to get stuc to
stuck use explicit
k in lo local constraints
cal minima corresp and repro jection rather
onding to small θ . When
corresponding than enforcing
training
constrain
neural net netwts with p enalties is that p enalties
works, this usually manifests as neural net can cause
netw non-conv ex
works that train with sevoptimization
several
eral
pro cedures to get stuc k in lo cal minima corresp onding
“dead units.” These are units that do not contribute much to the behavior of the to small θ . When training
neural
functionnet works, by
learned thisthe usually
net
network
workmanifests
becauseasthe neural
weigh net
eightsts w orks in
going that
to or
into train
out with
of themseveral
are
“dead units.” These are units that do not contribute
all very small. When training with a penalty on the norm of the weigh m uch to the behavior
weights, of
ts, thesethe
function learned
configurations can be lo by the net
locally work because the w eigh ts going in to
cally optimal, even if it is possible to significantly reduce or out of them are
all v ery small. When training with
J by making the weights larger. Explicit constrain a penalty on
constraints the norm
ts implemen
implemented of the weigh
ted by re-pro ts, these
re-projection
jection
configurations
can work muc much h better in these cases because they do not encourage the wreduce
can be lo cally optimal, even if it is p ossible to significantly eigh
eightsts
J bapproac
to y making
approach the w
h the eights larger.
origin. ExplicitExplicit
constraintsconstrain ts implemen
implemented byted by re-pro
re-pro
re-projection
jection jection
only
can
ha
hav vewanorkeffect
muchwhen better theinweigh
these
eightsts cases
become because
large and theyattempt
do not encourage leavee thethe
to leav weights
constraint
to approac
region. h the origin. Explicit constraints implemented by re-pro jection only
have an effect when the weights become large and attempt to leave the constraint
Finally
Finally,, explicit constraints with repro reprojection
jection can be useful because they imp imposeose
region.
some stabilit
stability y on the optimization pro procedure.
cedure. When using high learning rates, it
is pFinally
ossible, to explicit constraints
encounter with repro
a positive feedbacjection
feedback k loloopcaninbewhich
op usefullarge
because weighthey
weights ts imp ose
induce
some stabilit
large gradien
gradients y on
ts whic the
which optimization
h then induce a large up pro cedure.
update When using
date to the weigh high
eights. learning
ts. If these up rates,
updates
datesit
is possible to encounter a positive feedback loop in which large weights induce
large gradients which then induce a large 238up date to the weights. If these up dates
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

consisten
consistentlytly increase the size of the weigh ts, then θ rapidly mov
eights, moves
es aw
away
ay from
the origin un until
til numerical ov overflow
erflow occurs. Explicit constraints with repro reprojection
jection
consisten
allo
allow tly increase the size of the w eigh ts,
w us to terminate this feedback loop after the weights ha then θ rapidly
have mov es
ve reac
reachedaw ay from
hed a certain
the origin unHin
magnitude. til numerical
Hintonton et al. (ov erflow
2012c occurs. Explicit
) recommend using constraints
constraints withcom repro jection
combined
bined with a
allo w us to terminate
high learning rate to allo this
allow feedback loop after the w eights ha
w rapid exploration of parameter space while main ve reached a certain
maintaining
taining
magnitude.
some stabilitHin
stabilityy. ton et al. (2012c) recommend using constraints combined with a
high learning rate to allow rapid exploration of parameter space while maintaining
In particular, Hin Hinton
ton et al. (2012c) recommend constraining the norm of eac each
h
some stability.
column of the weighweightt matrix of a neural net lay layer,
er, rather than constraining the
Frob In particular,
robenius Hin
enius norm of the en ton et
entireal. (2012c
tire weigh ) recommend
weightt matrix, a strategy constraining
in
intro
tro
troducedtheby
duced norm of eac
Srebro h
and
column of the
Shraibman weigh
(2005 t matrix of athe
). Constraining neural
normnet layer,column
of each ratherseparately
than constraining
prev
prevents
ents an they
any
Frobhidden
one enius normunit of theha
from enving
tire weigh
having t matrix,
very large weigha ts.
strategy
eights. If we in trovduced
con
conv by Srebro
erted this constrain
constraintandt
Shraibman
in
into
to a penalt (2005
enalty y in). aConstraining the normit ofwould
Lagrange function, each column
be similar separately
to L2 wprev
eightents
eigh t decayany
one hidden unit from ha ving very large
but with a separate KKT multiplier for the weigh w eigh ts.
weightsts of each hidden unit. Each oft
If w e con verted this constrain
into aKKT
these penalt myultipliers
in a Lagrange
would b function, it would
e dynamically up be similar
updated
dated to L w
separately toeigh
makt edecay
make eac
each
h
but with a
hidden unit ob separate
obey KKT multiplier for the weigh ts of each hidden
ey the constraint. In practice, column norm limitation is alwa unit. Each
always of
ys
these
implemenKKT
implemented m ultipliers w ould
ted as an explicit constrainb e dynamically
constraintt with repro up dated
reprojection.
jection. separately to mak e each
hidden unit obey the constraint. In practice, column norm limitation is always
implemented as an explicit constraint with repro jection.
7.3 Regularization and Under-Constrained Problems

7.3someRegularization
In cases, regularization and Under-Constrained
is necessary for machine learning Problems
problems to be
prop
properly
erly defined. Man Manyy linear momodelsdels in machine learning, including linear re-
In some cases,
gression and PCA, depregularization
depend is
end on ininv necessary
verting theformatrixmachine
X >X learning
. This problems to be
is not possible
properly
whenev
whenever er defined.
X >X is Man y linear
singular. Thismo dels incan
matrix machine learning,
be singular whenevincluding
whenever linear
er the data re-
truly
gression
has and PCA,
no variance depend
in some on inverting
direction, or when thethere areX
matrix few X
er .examples
fewer This is not
(ro
(rowspossible
ws of
ofXX)
whenev er X X is singular. This matrix can b e singular whenev
than input features (columns of X). In this case, many forms of regularization er the data truly
has no ond
variance > X + αI or when there are fewer examples (rows of X )
corresp
correspond to in vin
inv someXdirection,
erting instead. This regularized matrix is guaran guaranteedteed
than
to be input
in features (columns of X). In this case, many forms of regularization
invertible.
vertible.
correspond to inverting X X + αI instead. This regularized matrix is guaranteed
These linear problems ha hav
ve closed form solutions when the relev relevant
ant matrix
to be invertible.
is in
invertible.
vertible. It is also possible for a problem with no closed form solution to be
These linear problems
underdetermined. An examplehave isclosed
logisticform solutionsapplied
regression when to thea relev ant matrix
problem where
is inclasses
the vertible. areItlinearly
is also possible
separable.for Ifa aproblem
weightt with
weigh vectornowclosed
is ableform solution
to achiev
achieve to be
e perfect
underdetermined.
classification, then 2An example
w will ishiev
also ac logistic
achiev
hieve regression
e perfect appliedand
classification to ahigher
problem where
likelihoo
likelihood.d.
the classes
An iterativ are linearly
iterativee optimization proseparable.
procedure If
cedure lik a weigh
likee sto t
stochasticvector w is able to achiev
chastic gradient descent will con e p
contin
tinerfect
tinually
ually
classification, then 2w will also ac
increase the magnitude of w and, in theory hiev e p erfect classification and higher likelihoo
theory,, will never halt. In practice, a numerical d.
An iterativtation
implemen
implementatione optimization
of gradien pro
gradient ceduret lik
t descen
descent e sto
will evenchastic
tuallygradient
eventually descenttly
reach sufficien
sufficiently will contin
large ually
weigh
weightsts
increase the magnitude
to cause numerical ov of w
overflow, and, in theory , will never halt.
erflow, at which point its behavior will dep In practice,
depend a n
end on ho umerical
howw the
implemen tation of gradien t descen t will even tually
programmer has decided to handle values that are not real num reach sufficien
umb tly
bers. large weigh ts
to cause numerical overflow, at which point its behavior will depend on how the
Most forms
programmer hasofdecided
regularization
to handleare vable
aluestothat
guarantee
are notthe
realcon
convergence
vergence
num bers. of iterativ
iterativee
239to guarantee the convergence of iterative
Most forms of regularization are able
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

metho
methods ds applied to underdetermined problems. F For
or example, weigh
weightt deca
decayy will
cause gradien
gradientt descent to quit increasing the magnitude of the weights when the
metho
slop dsthe
slopee of applied
lik toounderdetermined
likeliho
eliho
elihoo d is equal to the wproblems.
eigh
eightt deca yFor
decay co example,
coefficien
efficien
efficient.t. weight decay will
cause gradient descent to quit increasing the magnitude of the weights when the
slopThe
e of idea ofeliho
the lik usingodregularization
is equal to thetoweigh
solv
solvete deca
underdetermined
y coefficient. problems extends
bey
eyond
ond mac
machine
hine learning. The same idea is useful for sev several
eral basic linear algebra
The idea of using regularization to solve underdetermined problems extends
problems.
beyond machine learning. The same idea is useful for several basic linear algebra
As we saw in Sec. 2.9, we can solve underdetermined linear equations using
problems.
the Mo
Moore-P
ore-P
ore-Penrose
enrose pseudoin
pseudoinv verse. Recall that one definition of the pseudoinv
pseudoinverse
erse
+As w e saw
X of a matrix X isin Sec. 2.9 , w e can solve underdetermined linear equations using
the Moore-Penrose pseudoinverse. Recall that one definition of the pseudoinverse
X of a matrix X is X + = lim (X > X + αI )−1X > . (7.29)
α&0

We can no w recognize Eq.X7.29


now = as
limp(erforming
X X + αlinear
I ) Xregression
. eightt (7.29)
with weigh deca
decay
y.
Sp
Specifically
ecifically
ecifically,, Eq. 7.29 is the limit of Eq. 7.17 as the regularization co coefficient
efficient shrinks
W
toezero.
can noWwe recognize
can thus in Eq. 7.29
interpret
terpret as
the p erforming
pseudoin
pseudoinvv linear
erse as regression
stabilizingwith weight decay.
underdetermined
Specifically
problems , Eq. regularization.
using 7.29 is the limit of Eq. 7.17 as the regularization coefficient shrinks
to zero. We can thus interpret the pseudoinverse as stabilizing underdetermined
problems using regularization.
7.4 Dataset Augmen
Augmentation
tation

7.4bestDataset
The wa
wayy to make Augmen
a mac hinetation
machine learning mo
model
del generalize better is to train it on
more data. Of course, in practice, the amoun amountt of data we ha havve is limited. One way
The
to get around this problem is to create fake data and add itetter
b est wa y to make a mac hine learning mo del generalize b is totraining
to the train it set.
on
more data.
For some mac Of course,
machine in practice, the amoun t of
hine learning tasks, it is reasonably straigh data w e ha v e
straightforwardis limited.
tforward to create newOne w ay
to
fakget
fake around this problem is to create fake data and add it to the training set.
e data.
For some machine learning tasks, it is reasonably straightforward to create new
This approac
approach h is easiest for classification. A classifier needs to tak takee a compli-
fake data.
cated, high dimensional input x and summarize it with a single category iden tity y .
identity
This approac h is easiest for classification.
This means that the main task facing a classifier is to be invA classifier needs
invariant to tak e a compli-
ariant to a wide variety
cated,
of high dimensional
transformations. We input x and summarize
can generate new (x, y )itpairs
with easily
a singlejustcategory identity y .
by transforming
This
the xmeans
inputsthat the training
in our main taskset. facing a classifier is to be invariant to a wide variety
of transformations. We can generate new (x, y ) pairs easily just by transforming
This approac
approach h is not as readily applicable to many other tasks. For example, it
the x inputs in our training set.
is difficult to generate new fak fakee data for a density estimation task unless we hav havee
This solv
already approac
solved
ed theh isdensit
not as
density readily applicable
y estimation problem.to many other tasks. For example, it
is difficult to generate new fake data for a density estimation task unless we have
Dataset augmen
augmentation
tation has been a particularly effective tec technique
hnique for a sp specific
ecific
already solved the density estimation problem.
classification problem: ob object
ject recognition. Images are high dimensional and include
Dataset augmen
an enormous variet ariety tation
y of factorshas bofeen a particularly
variation, many of effective technique
which can be easilyfor simulated.
a specific
classification
Op
Operations
erations lik problem:
like ob jectthe
e translating recognition. Imagesa are
training images fewhigh dimensional
pixels and include
in each direction can
an enormous v
often greatly impro ariet
improvey of factors of
ve generalization, evvariation,
even many of which can be easily
en if the model has already been designed to simulated.
Op erations like translating
be partially translation in inv the
varian training
ariantt by using images
the cona vfew
conv pixels
olution andin peach direction
ooling can
techniques
often greatly improve generalization, even if the model has already been designed to
be partially translation invariant by using 240 the convolution and p o oling techniques
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

describ
described ed in Chapter 9. Many other op operations
erations suc such h as rotating the image or
scaling the image ha haveve also proproven
ven quite effective.
described in Chapter 9. Many other operations such as rotating the image or
One must be careful not to apply transformations that would change the correct
scaling the image have also proven quite effective.
class. For example, optical character recognition tasks require recognizing the
One must
difference betetwwbeen
e careful
’b’ and not’d’to and
apply thetransformations
difference bet etw wthat
een w ’6’ould
andchange
’9’, so the correct
horizon
horizontal tal
class.and
flips For180 example,

rotations optical
are not character
appropriate recognition
wa
waysys of tasks
augmen require
augmenting recognizing
ting datasets the
for these
difference between ’b’ and ’d’ and the difference between ’6’ and ’9’, so horizontal
tasks.
flips and 180 rotations are not appropriate ways of augmenting datasets for these
There are also transformations that we would like our classifiers to be in invvariant
tasks.
to, but whic
which h are not easy to perform. For example, out-of-plane rotation can not
There areted
be implemen
implemented alsoastransformations
a simple geometric that operation
we would like on theour input
classifiers
pixels.to be invariant
to, but which are not easy to perform. For example, out-of-plane rotation can not
Dataset augmen
augmentation
tation is effect for sp speec
eec
eechh recognition tasks as well (Jaitly and
be implemented as a simple geometric operation on the input pixels.
Hin
Hinton
ton, 2013).
Dataset augmentation is effect for speech recognition tasks as well (Jaitly and
Injecting noise in the input to a neural net netw work (Sietsma and Do Dow w, 1991)
Hinton, 2013).
can also be seen as a form of data augmentation. For man many y classification and
ev
even Injecting noise in the input to a neural net
en some regression tasks, the task should still be possible to solv w ork (Sietsma and
solve Do
e ev
evenenwif, 1991
small)
can also b e seen as a form of
random noise is added to the input. Neural netwdata augmentation.
networks For
orks prov man y classification
provee not to be very robust and
ev
to en some
noise, howregression
ever (Tang
however tasks,
and the task should
Eliasmith , 2010). still
Onebe wapossible
way y to improto solv
improv ve etheeven if small
robustness
random
of neuralnoise
net
netw wisorks
added to the input.
is simply to train Neural
them netwwithorks provenoise
random not toapplied
be verytorobusttheir
to noise, how ever ( T ang and Eliasmith
inputs. Input noise injection is part of some unsup , 2010 ). One wa
unsupervised y to impro ve the
ervised learning algorithms such robustness
of neural net w
as the denoising auto orks is
autoencosimply
enco
encoder to train
der (Vincen them
Vincentt et al. with
al.,, 2008 random
). Noisenoise applied
injection alsotoworks
their
inputs.theInput
when noisenoise injection
is applied is part
to the of some
hidden units,unsup
whicervised
which h can blearning
e seen asalgorithms
doing dataset such
as the denoising
augmen
augmentation
tation at m auto encolevels
ultiple der (Vincen t et al., P
of abstraction. 2008
oole).etNoise injection
al. (2014 ) recently alsoshoworks
showed
wed
when the noise is applied to the hidden units, whic
that this approach can be highly effective provided that the magnitude of theh can b e seen as doing dataset
augmen
noise is tation
carefully at m ultipleDrop
tuned. levelsout,
Dropout, of abstraction.
a powowerful Poole et al. (2014
erful regularization ) recently
strategy that sho willwed
be
that
describthis
described ed approach
in Sec. can
7.12 , b
cane highly
b e seeneffective
as a provided
pro
process
cess of that the
constructing magnitude
new inputsof the
by
noise is carefully
multiplying by noise. tuned. Drop out, a p ow erful regularization strategy that will be
described in Sec. 7.12, can be seen as a process of constructing new inputs by
When comparing
multiplying by noise. mac machine
hine learning benchmark results, it is important to tak takee
the effect of dataset augmen augmentationtation into account. Often, hand-designed dataset
When
augmen
augmentation comparing
tation schemes mac
can hine learning reduce
dramatically benchmark results, it is important
the generalization error of a mac to tak
hinee
machine
the effecttec
learning ofhnique.
dataset
technique. Toaugmen
compare tation into account.
the performance Often,
of one mac hand-designed
machine
hine learning algorithm dataset
augmen
to tation
another, it schemes can dramatically
is necessary reduce the generalization
to perform controlled experiments. When error ofcomparing
a machine
learning
mac hine learning algorithm A and machine learning algorithm B, it is algorithm
machine tec hnique. T o compare the performance of one mac hine learning necessary
to another,
to mak it is necessary to
makee sure that both algorithms were ev p erform controlled
evaluated experiments. When
aluated using the same hand-designed comparing
machineaugmen
dataset learning
augmentation algorithm
tation schemes. A and Suppmachine
Supposeose that learning
algorithmalgorithm
A performsB, it p isonecessary
orly with
to mak e sure
no dataset augmen that both
augmentation algorithms w ere ev aluated
tation and algorithm B performs well when com using the same hand-designed
combined
bined with
numerous synthetic transformations of the input. In such a case it isolikely
dataset augmen tation schemes. Supp ose that algorithm A p erforms p orly withthe
no
syn dataset
synthetic augmen tation and
thetic transformations caused the improv algorithmimprovedB p erforms w ell when com
ed performance, rather than the use bined with
numerous
of mac hinesynthetic
machine learning transformations
algorithm B. Sometimes of the input. In such
deciding a case an
whether it isexp
likely
experimen
erimen
erimentthet
synthetic transformations caused the improved performance, rather than the use
of machine learning algorithm B. Sometimes 241 deciding whether an experiment
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

has been prop


properly
erly con
controlled
trolled requires subsubjective
jective judgment. For example, mac machine
hine
learning algorithms that inject noise into the input are performing a form of dataset
has beentation.
augmen properly
augmentation. controlled
Usually
Usually, requires
, operations subare
that jective judgment.
generally For example,
applicable (such asmac hine
adding
learning algorithms
Gaussian that
noise to the inject
input) arenoise into thepart
considered input
of are
thepmac
erforming
machine a form algorithm,
hine learning of dataset
augmen
while optation.
operations Usually , operations
erations that are spspecific that are generally applicable (such
ecific to one application domain (such as randomly as adding
Gaussian an
cropping noise to theare
image) input) are considered
considered part of pre-pro
to be separate the maccessing
hine learning
pre-processing steps. algorithm,
while operations that are specific to one application domain (such as randomly
cropping an image) are considered to be separate pre-processing steps.
7.5 Noise Robustness

7.5 7.4Noise
Sec. has motiv Robustness
motivated
ated the use of noise applied to the inputs as a dataset aug-
men
mentation
tation strategy
strategy.. For some mo models,
dels, the addition of noise with infinitesimal
Sec. 7.4 has motiv ated
variance at the input of the mo the use
modelof
delnoise
is equiv applied
equivalentalentto totheimp inputs
osing aaspaenalty
imposing dataset on aug-
the
men tation strategy
norm of the weigh weights . F or some mo dels, the addition
ts (Bishop, 1995a,b). In the general case, it is imp of noise with infinitesimal
important
ortant to
vremem
ariance
rememb beratthat
the noise
inputinjection
of the mo candelbeismuc equiv
much alent pto
h more owimp
erfulosing
thanasimply penalty on the
shrinking
norm
the of the weigh
parameters, esptsecially
(Bishop
especially , 1995a
when the,bnoise
). Inisthe general
added to thecase, it is imp
hidden ortant
units. Noiseto
remembto
applied er the
thathidden
noise injection
units is such can anbe impmucortan
importan h more
ortant powas
t topic erful
to than
merit simply
its ownshrinking
separate
the parameters,
discussion; the drop esp
dropout ecially when
out algorithm describthe noise
described is added to the hidden
ed in Sec. 7.12 is the main developmen units.
development Noiset
applied
of to the hidden
that approac
approach. h. units is such an important topic as to merit its own separate
discussion; the dropout algorithm described in Sec. 7.12 is the main development
Another wa wayy that noise has been used in the service of regularizing mo modelsdels
of that approach.
is by adding it to the weigh weights. ts. This technique has been used primarily in the
con Another
context wa
text of recurren y that noise
recurrentt neural netw has
networksb een (used
orks Jim et in al.
the, 1996
service; Graofvregularizing
Grav es, 2011). This modelscan
ise by
b adding it as
interpreted to athesto weigh
stochastic
chastic ts.implemen
This technique
implementation tation ofhas been
a Bay
Bayesian usedinference
esian primarily oviner the
the
con
weigh text
eights. of recurren
ts. The Ba Bay t neural netw orks ( Jim et al.
yesian treatment of learning would consider the mo , 1996 ; Gra v es , 2011 ).
model This
del we weigh can
igh
ights
ts
b e interpreted as a
to be uncertain and represen sto chastic
representable implemen
table via a probabilit tation
probability of a Bay esian inference
y distribution that reflects this o v er the
w eigh ts.
uncertain The Ba yesian treatment of learning
uncertaintty. Adding noise to the weights is a practical, would consider
stochasticthe mo wa
waydel
y towe ights
reflect
to beuncertain
this uncertain
uncertaint ty and
(Gra represen
Gravesves, 2011 table
). via a probability distribution that reflects this
uncertainty. Adding noise to the weights is a practical, stochastic way to reflect
This can also be in interpreted
terpreted as equiv equivalen alen
alentt (under some assumptions) to a
this uncertainty (Graves, 2011).
more traditional form of regularization. Adding noise to the weigh weights ts has been
sho
shown This can also
wn to be an effectiv b e in terpreted as equiv alen t (under
effectivee regularization strategy in the context of recurren some assumptions)
recurrentt neural to a
more
net
netw traditional
works (Jim et form of regularization.
al., 1996 ; Gra
Grav ves, 2011).Adding In thenoise to thewe
following, weigh
willtspresent
has been an
sho wn to b e an effectiv
analysis of the effect of weigh e regularization strategy
eightt noise on a standard feedforw in the context
feedforward of recurren
ard neural netw t
network neural
ork (as
net
in
introw
tro orks
troduced ( Jim et al.
duced in Chapter 6). , 1996 ; Gra v es , 2011 ). In the following, we will present an
analysis of the effect of weight noise on a standard feedforward neural network (as

introWduced
e study in the 6). setting, where we wish to train a function yˆ( x) that
regression
Chapter
maps a set of features x to a scalar using the least-squares cost function betw between een
the moW e study the regression setting, where
del predictions yŷˆ(x) and the true values y:
model we wish to train a function y
ˆ ( x ) that
maps a set of features x to a scalar using  the least-squares  cost function between
2
the model predictions yˆ(x) Jand =E the ) (y
true
p(x,y x) − y
ŷˆv(alues y:) . (7.30)
E
The training set consists of Jm=labeled examples (yˆ(x) y{()x (1) . , y (1)), . . . , (x (m), y (m)(7.30)
)}.
242 −
The training set consists of m labeled examples (x , y ), . . . , (x , y ) .
{ }
 
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

We no noww assume that with eac each h input presen


presentation
tation we also include a random
perturbation W ∼ N (; 0, ηI ) of the net netw
work weighweights. ts. Let us imagine that we
ha
hav W e no w assume
ve a standard l-la -lay that
yer MLP with eac h input
MLP.. We denote the perturb presen tation
erturbed we del
ed mo
modelalsoasinclude
yŷˆW (x)a. Despite
random
perturbation
the  (
injection of noise, we are still; 0 , η I ) ofin the netwin
interested
terested orkminimizing
weights. Let the us imagine
squared errorthat we
of the
ha ve a standard
output of the net l -la yer
∼ N The ob
network.
work. MLP . W e denote
objective the
jective function th p erturb
thus ed mo
us becomes: del as y
ˆ (x ). Despite
the injection of noise, we are still interested h in minimizing
i the squared error of the
output of the network. J̃˜W =The
J Ep(xob jectiveŷˆfunction
,y,W ) (y W (x) − y )
2
thus becomes: (7.31)
E  2

J˜ = Ep(x,y,W ) yŷˆ(yˆW (x (x) )− 2y y)ˆW (x) + y2 . (7.32)
(7.31)
E −2y yˆ (x) + y .
For small η, the minimization= ofyˆJ with
(x) added weight noise (with co cov (7.32)
variance
ηI ) is equiv
equivalent to minimization of J with − an additional regularization term:
F or  alent
small η , the 2minimization of
h J with addedi w eight noisethe (with covariance
ηEp(x,y) k∇ W yŷˆ(x)k . This form of regularization encourages parameters to
η I ) is equiv alent to minimization of J with an
goE to regions of parameter space where small perturbations of the weigh additional r egularization
weights term:
ts hav
havee
η y
ˆ ( x ) . This form  regularization encourages
of  the parameters to
a relatively small influence on the output. In other words, it pushes the mo model
del
go
in
into
toto regions
k∇
regions of
where parameter
kthe mo
model spaceis where
relativ
del relatively small
ely p erturbations
insensitive to smallof the
v weigh
ariations ts
in hav
thee
a eigh
w relatively
eights, smallpoin
ts, finding influence
oints ts that on arethe notoutput.
merely In other words,
minima, it pushes
but minima the model
surrounded by
in to regions
flat regions where
(Hochreiter the mo del
and Schmidhis relativ
Schmidhub ub ely
uber insensitive to small v ariations
er, 1995). In the simplified case of linear in the
w eigh ts, finding p oin  that
regression (where,
  for instance, yŷˆ( x) = w >x +minima,
ts are not merely but minima surrounded
b ), this regularization term collapses by
flat
in to regions
into ηEp(x) (kHochreiter
xk2 , whic which hand Schmidh
is not uber, 1995
a function ). In the simplified
of parameters and therefore case does
of linear
not
regression (where, for instance,˜yˆ( x) = w x + b ), this regularization term collapses
con
contribute
tribute
E gradientt of J W with resp
to the gradien respect
ect to the model parameters.
into η x , which is not a function of parameters and therefore does not
contribute tok the k gradient of J˜ with respect to the model parameters.
7.5.1 Injecting Noise at the Output Targets

7.5.1 datasets
Most Injecting
 hahav Noiseamoun
ve some at the
amount t ofOutput
mistakes T inargets
the y lab
labels.
els. It can be harmful
to maximize log p(y | x) when y is a mistak mistake.e. One way to prev preven en
entt this is to
Most datasets
explicitly mo
model ha v e some amoun
del the noise on the lab t of
labels.mistakes in the y lab els.
els. For example, we can assume that It can be harmful
for some
to maximize log p (y x ) when
small constant , the training set lab y is a
label mistak e. One w ay
el y is correct with probabilitto prev
probability en
y 1 − , isand
t this to
explicitly mo
otherwise anydelofthethenoise
|other onpthe labels.
ossible lab
labelsFor might
els example, we can assume
be correct. that for some
This assumption is
small constant
easy to incorp 
incorporate, the
orate in training
into set lab el y is
to the cost function analyticallycorrect with probabilit y 1
analytically,, rather than by explicitly  , and
otherwise
dra
drawing any of the
wing noise samples. F other
For p ossible
or example, lablab els
labelel might
smo be correct.
smoothing
othing regularizesThisaassumption
mo
model − based
del is
easy to incorp orate in to the cost function analytically ,
on a softmax with k output values by replacing the hard 0 and 1 classification rather than by explicitly
drawingwith
targets noisetargets
samples. of kFand
or example,
1 − k−1 lab el smoothing
, respectively
respectively.. Theregularizes
standarda cross-en
model based
cross-entropy
tropy
k
on a softmax with k output values by
loss may then be used with these soft targets. Maxim replacing the
Maximum hard
um lik 0
likelihoand
eliho 1 classification
elihoood learning with a
targets with targets of and
softmax classifier and hard targets ma 1 may  , respectively
y actually nev .
never The
er con
conv verge—thecross-en
standard softmax tropy
can
loss
nev
nevermay then be used
er predict a probabilit
probabilitywith these − soft targets. Maxim um lik
y of exactly 0 or exactly 1, so it will con eliho o d learning
tinue to learna
continue with
softmax
larger andclassifier
larger w and hardmaking
eights, targetsmore
may extreme
actually nev er converge—the
predictions forever. Itsoftmax
is possiblecan
nevpreven
to er predict
prevent t thisa probabilit
scenario using y of exactly 0 or exactly 1strategies
other regularization , so it will lik con
likee tinue
weigh
weight to learn
t deca
decay y.
larger
Lab
Label and
el smo larger
smoothing w eights,
othing has the adv making
advantage more
antage of prev extreme
preventing predictions forever. It
enting the pursuit of hard probabilitiesis possible
to preven t this scenario using other regularization
without discouraging correct classification. This strategy strategies haslikbeeenweigh
used t deca
sincey.
Label smoothing has the advantage of preventing the pursuit of hard probabilities
without discouraging correct classification. 243 This strategy has been used since
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

the 1980s and con


contin
tin
tinues
ues to be featured prominen
prominently
tly in mo
modern
dern neural netw
networks
orks
(Szegedy et al., 2015).
the 1980s and continues to be featured prominently in modern neural networks
(Szegedy et al., 2015).
7.6 Semi-Sup
Semi-Supervised
ervised Learning

7.6the paradigm
In Semi-Sup ervised Learning
of semi-supervised learning, both unlab eled examples from P (x)
unlabeled
and lab eled examples from P (x, y ) are used to estimate P (y | x) or predict y from
labeled
In
x. the paradigm of semi-supervised learning, both unlabeled examples from P (x)
and labeled examples from P (x, y ) are used to estimate P (y x) or predict y from
x. In the con context
text of deep learning, semi-sup semi-supervised
ervised learning usually refers to
|
learning a represen
representation
tation h = f ( x ) . The goal is to learn a represen representationtation so
thatInexamples
the context fromofthe deepsame learning,
class ha havsemi-sup
ve similar ervised learning usually
representations. Unsup refers
Unsupervised
ervised to
learning can
learning a represen
provide tation
useful h cues
= f (for x ). how
The to goalgroupis toexamples
learn a represen tation so
in representation
that examples from the
space. Examples that cluster tigh same class
tightly ha v e similar representations.
tly in the input space should be mapp Unsup
mapped ervised
ed to
learning can
similar represen provide
representations. useful cues for how to group
tations. A linear classifier in the new space may achiev examples in representation
achievee better
space. Examples
generalization in manthaty cluster
many tightlyand
cases (Belkin in the
Niy
Niyogiinput
ogi , 2002 space should
; Chap elle etbeal.
Chapelle mapp
al.,
, 2003 ed). to
A
similar represen
long-standing varian tations. A linear classifier in the new
ariantt of this approach is the application of principal comp space may achiev e
componen b etter
onen
onentsts
generalization
analysis as a pre-pro in man y
pre-processing cases (Belkin and Niy ogi , 2002
cessing step before applying a classifier (on the pro ; Chap elle et al. , 2003 ).
projected
jected A
long-standing variant of this approach is the application of principal components
data).
analysis as a pre-processing step before applying a classifier (on the pro jected
Instead of having separate unsup unsupervisedervised and sup supervised
ervised comp
componen onen
onents ts in the
data).
mo
model,
del, one can construct mo dels in which a generative model of either P (x) or
models
P ( x, y) shares parameters withunsup
Instead of having separate ervised and sup
a discriminative mo ervised
model
del of Pcomp
(y | x onen
). One ts in can
the
model,
then one canthe
trade-off construct
sup
supervised models
ervised in which
criterion − loga Pgenerative
(y | x) with model
the of either
unsup
unsupervised P (x) or
ervised or
P ( x, y ) shares
generativ
generative e one (suc parameters
(such h as − logwith P (x)aordiscriminative
− log P (x, y )).mo The of P (y x
delgenerative ). One then
criterion can
then trade-off
expresses the supervised
a particular prior belieflog
form of criterion ab Pout
about (y the x) solution
with thetounsup ervised
|the sup
supervised
ervised or
generativ e one (suc h as log
learning problem (Lasserre et al., 2006− P ( x ) or log P (x , y ) ). The generative
), namely |that the structure of P(x ) is criterion then
expresses a particular
connected to the structure form
− of P (y | x−) in aab
of prior b elief waout
way y thatthe solution
is captured to the supervised
by the shared
learning problem By
parametrization. (Lasserre
con
controlling et al.
trolling ho, w2006
how muc
much),hnamely that the structure
of the generative criterion is of included
P(x ) is
connected
in the totalto the structure
criterion, one canoffind P (ya bxetter ) in atrade-off
way that is captured
than with a purely by the shared
generative
parametrization.
or a purely discriminativ By conetrolling
discriminative training how | much of
criterion the generative
(Lasserre criterion
et al., 2006 is included
; Larochelle and
in the total
Bengio, 2008). criterion, one can find a b etter trade-off than with a purely generative
or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and
In the
Bengio con
context
, 2008 ).text of scarcit
scarcity y of labeled data (and abundance of unlabeled data),
deep arcarchitectures
hitectures ha havve shown promise as well. Salakhutdino Salakhutdinov v and Hinton (2008)
In
describ the con
describee a metho text
method d for learning the kernel function of a kernelofmachine
of scarcit y of labeled data (and abundance unlabeled useddata),
for
deep arc hitectures
regression, in whic which ha ve shown promise as w
h the usage of unlabeled examples for mo ell. Salakhutdino v
modeling and
deling P Hinton
(x ) impro (
improv 2008
ves)
describ
P (y | xe) quite
a metho d for learning
significantly
significantly. . the kernel function of a kernel machine used for
regression, in which the usage of unlabeled examples for modeling P (x ) improves
P (ySeex)Chapelle et al. (2006. ) for more information ab
quite significantly about
out semi-sup
semi-supervised
ervised learning.
| Chapelle et al. (2006) for more information about semi-supervised learning.
See
244
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.7 Multi-T
Multi-Task
ask Learning

7.7 Multi-T
Multi-task learningask Learning
(Caruana , 1993) is a wa way
y to improv
improvee generalization by pooling
the examples (whic
(which h can be seen as soft constraints imp imposed
osed on the parameters)
Multi-task learning (Caruana , 1993 ) is
arising out of several tasks. In the same wa a way
wayy that additional trainingbexamples
to improv e generalization y pooling
the examples (whic h can b e seen as soft constraints
put more pressure on the parameters of the model tow imp osed
towards on the parameters)
ards values that generalize
arising
w out part
ell, when of several
of a motasks.
del is In
model the same
shared acrosswatasks,
y thatthat
additional training
part of the modelexamples
is more
put more pressure
constrained to
tow on
wards go the
goo parameters of the model tow ards values that generalize
od values (assuming the sharing is justified), often yielding
w
better generalization. del is shared across tasks, that part of the model is more
ell, when part of a mo
constrained towards good values (assuming the sharing is justified), often yielding
better generalization.
y (1) y (2)

h(1) h(2) h(3)

h(shared)

Figure 7.2: Multi-task learning can be cast in sev several


eral wa
waysys in deep learning framew frameworksorks
and this figure illustrates the common situation where the tasks share a common input but
Figure
in
invvolve 7.2: Multi-task
differen
differentt target learning
random vcan be cast
ariables. Thein lo
sev
lowereral
wer wa
lay ersysofina deep
layers deep learning
net work framew
network (whether orks
it
and
is sup this
supervisedfigureand
ervised illustrates the common
feedforward situation
or includes where the
a generative comptasks
componen
onen
onent share
t witha common
do
down
wn
wnwward input but
arrows)
involve
can differenacross
b e shared t targets uc random
uchh tasks,variables. The lo
while task-sp
task-specificwer parameters
ecific layers of a deep networkresp
(associated (whether
respectively it
ectively
is supthe
with ervised
weighand
weightsts infeedforward
into
to and from orhincludes
(1)
and ha generative
(2)
) can b e comp
learnedonen ont with
top ofdothose
wnward arrows)
yielding a
can b e representation
shared shared across h s uc h tasks,
(shared) while
. The task-sp ecific
underlying parameters
assumption is that(associated
there existsresp ectively
a common
with
p ool the weightsthat
of factors intoexplain
and from thehvariations
and h in) thecaninput
b e learned
x , whileon eac
tophoftask
each those yielding a
is associated
shared
with representation
a subset h
of these factors. . In The
thisunderlying
example, assumption is thatassumed
it is additionally there exists
thata top-lev
common
top-level
el
p ool of units
hidden factors that
h(1) andexplain thesp
h(2) are vecialized
ariationstoineach
specialized the input x , while
task (resp
(respectively eachpredicting
ectively task is associated
y(1) and
with
y (2) a subset of these factors. In this example, it is (shared)
) while some in termediate-level representation h additionally
intermediate-level assumed
is shared acrossthat top-levIn
all tasks. el
hidden
the unsup units
unsupervisedh and h
ervised learning con are sp
context, ecialized
text, it mak
makes to each task (resp ectively
es sense for some of the top-lev predicting
top-level y and
el factors to b e
yasso )ciated
while with
associated somenoneintermediate-level
of the output tasks representation
(h (3)): theseh are theisfactors
sharedthat
across all tasks.
explain some In
of
the input
unsup ervised learning con
variations but are not relevtext, it mak
relevant es sense for some(1) of the
ant for predicting y or y . (2)top-level factors to b e

associated with none of the output tasks (h ): these are the factors that explain some of
the Fig.
input7.2variations butaare notcommon
relevantform
for predicting y y .
illustrates very of multi-task or learning, in whic
which h different
sup ervised tasks (predicting y(i) giv
supervised en x) share the same input x, as well as some
given
in Fig. 7.2 illustrates
intermediate-lev
termediate-lev
termediate-levelel a very common
representation h formcapturing
(shared) of multi-task learning,
a common in of
pool whic h different
factors. The
supervised tasks (predicting y given x) share the same input x, as well as some
intermediate-level representation h 245capturing a common p o ol of factors. The
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

mo
model
del can generally be divided into two kinds of parts and associated parameters:

mo1.delTcan generally
ask-sp
ask-specific be divided(whic
ecific parameters into htwo
(which onlykinds
benefitof parts
fromandthe associated
examples ofparameters:
their task
to achiev
achievee go
goood generalization). These are the upp upperer lay
layers
ers of the neural
1. net
T ask-sp
netw ecific parameters
work in Fig. 7.2. (whic h only b enefit from the examples of their task
to achieve good generalization). These are the upper layers of the neural
2. Generic
network in parameters,
Fig. 7.2. shared across all the tasks (whic (which h benefit from the
pooled data of all the tasks). These are the low lowerer lay
layers
ers of the neural net
network
work
2. in
Generic parameters,
Fig. 7.2. shared across all the tasks (whic h b enefit from the
pooled data of all the tasks). These are the lower layers of the neural network
Impro
Improv in
vedFig. 7.2.
generalization and generalization error bounds (Baxter, 1995) can be
ac
achiev
hiev
hieved
ed because of the shared parameters, for which statistical strength can be
Improved
greatly generalization
improv
improveded (in prop and generalization
proportion
ortion error bounds
with the increased num
umb (Baxter
ber , 1995) for
of examples canthe
be
achieved
shared because ofcompared
parameters, the sharedto parameters,
the scenario for which statistical
of single-task mo dels).strength
models). Of coursecanthis
be
greatly improv ed (in prop ortion
will happen only if some assumptions ab with the increased
about n um b er of examples
out the statistical relationship betw for the
etween
een
shared
the parameters,
differen
differentt tasks compared
are valid, to the
meaning scenario
that of
there single-task
is somethingmo dels).
shared Of course
across this
some
will happen
of the tasks. only if some assumptions about the statistical relationship between
the different tasks are valid, meaning that there is something shared across some
From
of the the poin
tasks. ointt of view of deep learning, the underlying prior belief is the
follo
following:
wing: among the factors that explain the variations observed in the
datarom
F asso the
associatedpoinwith
ciated t of view
the of deep learning,
differen
different t tasks, somethe underlying
are shared prior belieftw
across isothe
two or
following:
more among the factors that explain the variations observed in the
tasks.
data associated with the different tasks, some are shared across two or
more tasks.
7.8 Early Stopping

7.8 training
When Early large Stopping
mo
models
dels with sufficien
sufficientt representational capacit
capacityy to ovoverfit
erfit
the task, we often observe that training error decreases steadily over time, but
When training
validation large
set error models
begins with
to rise sufficien
again. t representational
See Fig. 7.3 for an examplecapacit y to
of this overfit
behavior.
the task,
This beha we often
ehavior
vior occurs observe that training
very reliably
reliably.. error decreases steadily over time, but
validation set error begins to rise again. See Fig. 7.3 for an example of this behavior.
This means we can obtain a mo model del with better validation set error (and th thus,
us,
This behavior occurs very reliably.
hop
hopefully
efully better test set error) by returning to the parameter setting at the poin ointt
This means
in time with the loww e can
lowest obtain a mo del with better validation set error
est validation set error. Instead of running our optimization (and th us,
hopefully bun
algorithm etter
until
til wetest set
reac
reachh error)
a (lo by minim
(local)
cal) returning
minimum um oftovthe parameter
alidation error,setting
we runatit the
un poin
until
til thet
in time
error on with
the vthe lowestset
alidation validation set error.
has not improv
improved Instead
ed for some of running
amoun
amount our optimization
t of time. Every time
algorithm until we reac h a (lo
the error on the validation set improvcal) minim
improves, um of validation
es, we store a cop
copy error, we
y of the mo run
model it until the
del parameters.
error on the v alidation set has not improv ed for some amoun t of
When the training algorithm terminates, we return these parameters, rather than time. Every time
the latest
the error on the validation
parameters. Thissetproimprov
cedurees,
procedure we
is sp store amore
specified
ecified copyformally
of the moindel parameters.
Algorithm 7.1.
When the training algorithm terminates, we return these parameters, rather than
This strategy is kno known
wn as early stopping
stopping.. It is probably the most commonly
the latest parameters. This procedure is specified more formally in Algorithm 7.1.
used form of regularization in deep learning. Its popularity is due both to its
Thiseness
effectiv strategy
effectiveness and itsis kno wn asy.early stopping. It is probably the most commonly
simplicit
simplicity
used form of regularization in deep learning. Its popularity is due both to its
effectiveness and its simplicity. 246
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Learning curves
0.20

Loss (negative log likelihood)


Training set loss
0.15 Learning curves Validation set loss
Training set loss
0.10 Validation set loss

0.05

0.00
0 50 100 150 200 250
Time (epochs)

Figure 7.3: Learning curves sho


showing
wing ho
how
w the negativ
negativee log-likelihoo
log-likelihoodd loss changes over
time (indicated as num
numbb er of training iterations over the dataset, or ep epo
ochs ). In this
chs).
Figure 7.3: Learning curves sho
example, we train a maxout netw wing
network how the negativ e log-likelihoo d loss
ork on MNIST. Observe that the training ob changes over
objective
jective
time (indicated as num b er of training iterations over the
decreases consistently over time, but the validation set av dataset,
average or
erage loss even epo chs
tually b eginsthis
eventually ). In to
example,again,
increase we train a maxout
forming network on
an asymmetric MNIST.
U-shap
U-shaped Observe
ed curv
curve.
e. that the training ob jective
decreases consistently over time, but the validation set average loss eventually b egins to
increase again, forming an asymmetric U-shap ed curve.
One wa
way y to think of early stopping is as a very efficient hyperparameter selection
algorithm. In this view, the num numb ber of training steps is just another hyperparameter.
We One can waseey in
to Fig.
think7.3 of early
that stopping
this hyp is as a very efficient
yperparameter
erparameter has a hyperparameter
U-shap
U-shaped ed validationselection
set
algorithm. In this view, the num b er
performance curve. Most hyperparameters that con of training steps is
control just
trol mo another
model
del capacithyperparameter.
capacity y hav
havee such a
W e
U-shapcan
U-shaped ed validation set performance curve, as illustrated in Fig. 5.3. In the caseset
see in Fig. 7.3 that this h yp erparameter has a U-shap ed validation of
p erformance
early stopping, curve.
we are Most hyperparameters
controlling the effective that con
capacittrol
capacity y mo
of del
the mocapacit
model
del by y hav e such
determining a
U-shap
ho
howw man edyvsteps
many alidation
it cansettake
performance
to fit thecurve,
trainingas illustrated
set. Most hinyp Fig. 5.3. In themcase
yperparameters
erparameters ust bofe
cearly
hosenstopping,
using anwexp e are
expensivecontrolling
ensive guess andthe chec
effective
check capacit
k process, y of the
where we mosetdel
a hby
yp determining
yperparameter
erparameter
ho w man y steps it can take to fit the training set. Most
at the start of training, then run training for several steps to see its effect. hyp erparameters mustThebe
chosen using
“training time”an hyperparameter
expensive guess and check process,
is unique in thatwhere we set a h
by definition ayp erparameter
single run of
at the start
training triesofout
training,
man
many y vthen
aluesrun training
of the hyp for several steps
yperparameter.
erparameter. Theto seesignifican
only its effect.t The
significant cost
“training
to chohoosing time” hyperparameter is unique in that by definition
osing this hyperparameter automatically via early stopping is running the a single run of
training tries
validation set ev out man
evaluation y v alues
aluation perio of the
eriodically h yp erparameter.
dically during training. Ideally The only significan
Ideally,, this is done in t cost
to cho osing this hyperparameter
parallel to the training pro process automatically via early
cess on a separate machine, separate stopping CPU,is running
or separatethe
vGPU
alidation
from setthe ev aluation
main trainingperio dically
pro cess. Ifduring
process. training.areIdeally
such resources not av, ailable,
this is then
donethe in
parallel
cost to theperio
of these training
eriodic
dic ev pro cess onma
evaluations
aluations ay
may separate
be reducedmachine, separate
by using CPU, orsetseparate
a validation that is
GPU from the main training
small compared to the training set or by evpro cess. If such resources
evaluating are not a vailable,
aluating the validation set error less then the
cost
frequenof these
frequently p erio dic
tly and obtaining a loev aluations
lower ma y b e reduced by using a v alidation
wer resolution estimate of the optimal training time. set that is
small compared to the training set or by evaluating the validation set error less
An tly
frequen additional cost toaearly
and obtaining lowerstopping
resolutionis estimate
the needoftothe maintain
optimal atraining
copy oftime.the
best parameters. This cost is generally negligible, because it is acceptable to store
these Anparameters
additionalincost a slotower
slowerearly
andstopping
larger form is the need to (for
of memory maintain
example, a copy of the
training in
best parameters. This cost is generally negligible, because it is acceptable to store
these parameters in a slower and larger247 form of memory (for example, training in
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

GPU memorymemory,, but storing the optimal parameters in host memory or on a disk
driv
drive).
e). Since the best parameters are written to infrequen infrequently tly and nev neverer read during
GPU memory , but storing
training, these occasional slow writes ha the optimal haveparameters in host memory
ve little effect on the total training or on atime. disk
drive). Since the best parameters are written to infrequently and never read during
Early these
training, stopping is a very
occasional slowunobtrusiv
unobtrusive
writes ha e ve
form of regularization,
little effect on the total in that it requires
training time.
almost no change in the underlying training pro procedure,
cedure, the ob objectiv
jectiv
jectivee function,
or theEarlysetstopping
of allow is a v
allowable
able ery unobtrusiv
parameter values.e form
Thisofmeans regularization,
that it isin thattoituse
easy requires
early
almost no change in the underlying training
stopping without damaging the learning dynamics. This is in con pro cedure, the ob jectiv
contrast e function,
trast to weighweightt
or
deca the
decay set of allow able parameter
y, where one must be careful not to use to values. This too means
o muc uch that it
h weight deca is easy
decay y and trapearly
to use the
stopping
net
netw without
work in a bad lo damaging
local the
cal minimum corresplearning onding to a solution with pathologicallyt
dynamics.
corresponding This is in con trast to weigh
decay,wwhere
small eigh
eights.ts.one must be careful not to use too much weight decay and trap the
network in a bad local minimum corresponding to a solution with pathologically
Early stopping may be used either alone or in conjunction with other regulariza-
small weights.
tion strategies. Even when using regularization strategies that mo modify
dify the ob objective
jective
Early stopping may b e used either alone or in conjunction
function to encourage better generalization, it is rare for the best generalization to with other regulariza-
otion
ccurstrategies.
at a localEven minimumwhen of using
the regularization
training ob objective.strategies that modify the ob jective
jective.
function to encourage better generalization, it is rare for the best generalization to
occurEarly
at astopping requires of
local minimum a vthe
alidation
training set,obwhich
jective.means some training data is not
fed to the model. To best exploit this extra data, one can perform extra training
afterEarly stopping
the initial requires
training witha vearly
alidation set, which
stopping means someIntraining
has completed. data is
the second, not
extra
fed to thestep,
training model. Tothe
all of best exploitdata
training this is extra data, one
included. canare
There perform
two basicextrastrategies
training
aftercan
one theuseinitial training
for this second with early stopping
training procedure. has completed. In the second, extra
training step, all of the training data is included. There are two basic strategies
one One
can strategy
use for this (Algorithm 7.2) is toprocedure.
second training initialize the mo model
del again and retrain on all
of the data. In this second training pass, we train for the same num numb ber of steps as
the One
earlystrategy
stopping(Algorithm
procedure7.2 ) is to initialize
determined was optimalthe moin deltheagainfirstand retrain
pass. There on are
all
of thesubtleties
some data. In this second training
associated with thispass,
pro we trainFor
procedure.
cedure. forexample,
the same there numbis er not
of steps
a goo as
goodd
the early
way of kno stopping
knowing procedure determined was optimal
wing whether to retrain for the same number of parameter up in the first pass. There
updates are
dates or
some
the samesubtleties
num
umb b associated
er of passes with
throughthis pro
the cedure.
dataset. For
On example,
the second there
round is not
of a good
training,
weacayhofpass
each knothrough
wing whether to retrain
the dataset will for the same
require morenparameter
umber of parameter
up dates bup
updates datesthe
ecause or
the sameset
training num is bbigger.
er of passes through the dataset. On the second round of training,
each pass through the dataset will require more parameter updates because the
Another strategy for using all of the data is to keep the parameters obtained
training set is bigger.
from the first round of training and then con continuetinue training but no now w using all of
Another strategy
the data. At this stage, we no for using
now all of the
w no longer ha data
have ve a guide for when to stop obtained
is to k eep the parameters in terms
from
of a num the
numb first round of training and then
ber of steps. Instead, we can monitor the av con tinue training
erage loss function onallthe
average but no w using of
the data. A t this
validation set, and contin stage,
continuew e no w no longer ha
ue training until it falls belove a guide
elow for when to stop
w the value of the training in terms
of a
set ob num b
objective er of
jective at whic steps.
which Instead, we can monitor
h the early stopping procedure halted. the av erage Thisloss function
strategy on the
avoids
vthe
alidation
high cost set,ofand continue
retraining thetraining
mo
model until scratch,
del from it falls bbut elowis the
not vasalue of the
well-b
ell-behav
ehav training
ehaved.ed. For
set ob jective at whic h the early
example, there is not any guarantee that the ob stopping procedure
objectiv halted.
jectiv This strategy
jectivee on the validation set avoids
will
the
ev
ever high
er reac
reach cost of retraining the mo del from
h the target value, so this strategy is not ev scratch, but
even is not as w ell-b ehav
en guaranteed to terminate. ed. For
example,
This pro there is
procedure
cedure is presen
not any
presented tedguarantee that the
more formally ob jective on
in Algorithm 7.3the
. validation set will
ever reach the target value, so this strategy is not even guaranteed to terminate.
Early stopping is also useful because it reduces the computational cost of the
This procedure is presented more formally in Algorithm 7.3.
248it reduces the computational cost of the
Early stopping is also useful because
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

w∗ w∗

w̃ w̃

w2
w2

w1 w1

Figure 7.4: An illustration of the effect of early stopping. (L eft) The solid contour
(Left)
lines indicate the contours of the negative log-lik log-likeliho
eliho
elihoo o d. The dashed line indicates the
Figure
tra
trajectory
jectory7.4:tak
Anen illustration
taken of the effect
by SGD beginning of early
from the origin.stopping.
Rather than (Left) The solid
stopping contour
at the p oin
ointt
lines
∗ indicate the contours of the negative log-lik
w that minimizes the cost, early stopping results in the tra eliho o d. The dashed
trajectory line indicates
jectory stopping at an earlierthe
tra
p jectory
oin
ointtw taken bAn
w̃˜. (Right) y SGD beginning
illustration from
of the the origin.
effect 2 Rather than stopping
of L regularization at the pThe
for comparison. oint
w thatcircles
dashed minimizes the cost,
indicate early stopping
the contours of the Lresults
2 in ythe
p enalt
enalty tra jectory
, which causesstopping
the minimat um
minimuman earlier
of the
total w˜. (Right)
p oint cost An illustration
to lie nearer of thethe
the origin than effect of L regularization
minimum for comparison.
of the unregularized cost. The
dashed circles indicate the contours of the L p enalty, which causes the minimum of the
total cost to lie nearer the origin than the minimum of the unregularized cost.
training pro
procedure.
cedure. Besides the obvious reduction in cost due to limiting the num numb ber
of training iterations, it also has the benefit of providing regularization without
training pro
requiring thecedure.
additionBesides the obvious
of penalt
enalty y terms reduction
to the cost in function
cost due toor limiting the numbof
the computation er
of training
the gradien
gradients iterations,
ts of suc
such it also has
h additional terms. the benefit of providing regularization without
requiring the addition of penalty terms to the cost function or the computation of
the gradients of such additional terms.
Ho
How w early stopping acts as a regularizer: So far we ha havve stated that early
stopping is a regularization strategy strategy,, but we hav havee supp
supported
orted this claim only by
Ho
sho w early
showing
wing stopping
learning curves acts
whereasthe a vregularizer:
alidation set errorSo farhaswae U-shap
have stated
U-shaped that What
ed curve. early
stopping
is is a mechanism
the actual regularization strategy
by whic
which , butstopping
h early we haveregularizes
supportedthe thismoclaim
del? only
model? by
Bishop
sho wing learning
(1995a) and Sjöb curves
Sjöberg where the validation set error has a U-shap
erg and Ljung (1995) argued that early stopping has the effect of ed curve. What
is the actual
restricting themechanism
optimization by pro
whic h earlytostopping
procedure
cedure a relativ regularizes
relatively
ely small volumethe moofdel? Bishop
parameter
(space
1995a)inandtheSjöb erg and oLjung
neighborho
neighborhoo (1995
d of the ) argued
initial that early
parameter stopping
value θ o. More has the
sp effect of,
specifically
ecifically
ecifically,
restricting
imagine the optimization
taking τ optimization prosteps
cedure to a relativelytosmall
(corresponding volume
τ training of parameter
iterations) and
space in the neighborho o d of the initial parameter
with learning rate . We can view the product τ as a measure of effectiv value θ . More sp ecifically
effectivee capacit
capacityy,.
imagine taking
Assuming τ optimization
the gradien
gradient t is bounded, stepsrestricting
(corresponding
both the to τnum
training
umb ber ofiterations)
iterations and
and
with learning rate . W e can view the product
the learning rate limits the volume of parameter space reac τ as a measure hable from θo . In thisy.
reachableof effectiv e capacit
Assuming
sense, τ btheeha
ehavvgradien
es as if titiswbere
ounded, restricting
the recipro
reciprocal both
cal of the the numused
coefficient ber of foriterations
weigh and
weightt deca
decay
y.
the learning rate limits the volume of parameter space reachable from θ . In this
Indeed,
sense, τ behawevcan
es assho
show
if w
it ho
how—in
w—in
were the the casecal
recipro of of
a simple linear mo
the coefficient model
del with
used a quadratic
for weigh t decay.
error function and simple gradient descent—early stopping is equiv alent to L2
equivalent
Indeed, we can show how—in the case of a simple linear model with a quadratic
error function and simple gradient descent—early 249 stopping is equivalent to L
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

regularization.
In order to compare with classical L 2 regularization, we examine a simple
regularization.
setting where the only parameters are linear weigh eights
ts (θ = w). We can mo model
del
In order to compare with classical
the cost function J with a quadratic appro L regularization,
approximation w e examine
ximation in the neighborho
neighborhoo a simple
od of the
setting where the only parameters
empirically optimal value of the weigh are
weights linear

ts w : weigh ts (θ = w ). W e can model
the cost function J with a quadratic approximation in the neighborhood of the
empirically optimalJˆv(alue 1 w: ∗ >
θ) =ofJ (the
w ∗ )weigh
+ (tsw − w ) H (w − w ∗), (7.33)
2
1
Jˆ(θ) matrix
where H is the Hessian = J (wof) J (w respect
+ with w ) toHw(wev w ), at w∗ . Given
evaluated
aluated (7.33)
the
∗ 2
assumption that w is a minim
minimum um of J (w ),−we know that−H is positiv ositivee semidefinite.
where H is the Hessian matrix
Under a local Taylor series appro of J w
ximation, the gradient is given at
with
approximation, respect to evaluated by:w . Given the
assumption that w is a minimum of J (w ), we know that H is positive semidefinite.
∇w Jˆximation,
Under a local Taylor series appro (w) = H (w w∗).
the−gradient is given by: (7.34)

We are going to study the tra Jˆ(jectory


w) = H
trajectory (w edwby
follow
followed (7.34)
). the parameter vector during
∇ set the initial parameter
− 3
training. For simplicit
simplicity
y, let us vector to the origin, that
W(0)e are going to study the tra jectory followed by the parameter vector during
is w = 0. Let us supp suppose
ose that we up update
date the parameters via gradien
gradientt descen
descent:
t:
training. For simplicity, let us set the initial parameter vector to the origin, that
is w = 0. Let us suppose w (that
τ)
=w we(τ− 1)
− ∇
update (w (τ−1)) via gradient descen
w Jparameters
the (7.35)
t:
(τ−1) (τ−1) ∗
w =
=w w − H (wJ (w − )w ) (7.36)
(7.35)
w (τ ) − w∗ = (wI − H−)(w
= (τ−1)
H (w − w )w )


(7.37)
(7.36)
Let us no
now
w rewrite this
w expressionw = (inI the −)(w of the eigen
Hspace w−) vectors of
eigenvectors ofHH , exploiting
(7.37)
>
the eigendecomp osition of−H: H = Q
eigendecomposition − ΛQ , where Λ − is a diagonal matrix and Q
Letanusorthonormal
is now rewrite basis
this expression
of eigen in the
eigenvectors.
vectors. space of the eigenvectors of H , exploiting
the eigendecomposition of H: H = QΛQ , where Λ is a diagonal matrix and Q
is an orthonormal basisw of(τ )
− w ∗vectors.
eigen = ((II − QΛQ>)( )(ww (τ−1) − w ∗) (7.38)
Q>(ww(τ ) − w ∗
w) = = ((II − Λ
Q)Λ
QQ>
(w (τ−1)
)(w − ww∗
)) (7.39)
(7.38)
Assuming that wQ (0) (w −wthat
= 0 and ) = (Iis−chosen
Λ)Q to (w be small−
wenough
) (7.39)
to guarantee
|1 − λi | < 1, the parameter− tra
trajectory
jectory−during training after
− τ parameter up
updates
dates
Assuming
is as follo that w = 0 and that  is chosen to be small enough to guarantee
follows:
ws:
1 λ < 1, the parameter tra jectory during training after τ parameter updates
is
| as
− follo
| ws: Q >w (τ ) = [I − (I − Λ)τ ]Q>w ∗ . (7.40)

No
Now, Q >w
w, the expression for Q ˜ in =

w Eq.
[I 7.13 Λ2) regularization
(I forL ]Q w . can be rearranged
(7.40)
as: − −
Now, the expression for Q w ˜ in Eq. 7.13 for L regularization can be rearranged
as: Q>w˜ = (Λ + αI ) −1 ΛQ >w ∗ (7.41)
3
For neural networks, to obtain
Q w˜ symmetry
= (Λ + αbreaking
I ) ΛQ between
w hidden units, we cannot initialize
(7.41)
all the parameters to 0, as discussed in Sec. 6.2. However, the argument holds for any other
initial value w(0) .

250
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Q>w˜ = [I − (Λ + αI )−1 α]Q>w∗ (7.42)

Comparing Eq. 7.40 andQEq. w˜ = [I, we(Λsee


7.42 αI ) if αthe
+ that ]Q hw yp erparameters , α, (7.42)
yperparameters and τ
are chosen suc suchh that −
Comparing Eq. 7.40 and Eq. (I − Λ)wτ e=see
7.42 , (Λthat+ αIif)−1 theα,hyperparameters , α, (7.43) and τ
are chosen such that
then L2 regularization and(IearlyΛ stopping
) = (Λ + can αIb)e seen
α, to be equiv equivalent
alent (at(7.43)
least
under the quadratic approximation − stoppingof the ob objectiv
jectiv
jectivee function). Going ev even
en further,
then L regularization and early can b e seen to
by taking logarithms and using the series expansion for log (1 + x), we can conclude b e equiv alent (at least
under the quadratic approximation of the ob
that if all λi are small (that is, λi  1 and λi/α  1) then jectiv e function). Going ev en further,
by taking logarithms and using the series expansion for log (1 + x), we can conclude
that if all λ are small (that is, λ 1 and 1 λ /α 1) then
τ≈ , (7.44)
 α 
11
ατ ≈ α., (7.44)
(7.45)
≈ 1 τ 
α . ys (7.45)
That is, under these assumptions, the num umbτb  er of training iterations τ pla plays a role
in
inv
versely prop
proportional 2 ≈
ortional to the L regularization parameter, and the inv erse of τ 
inverse
That
pla
plays is, under these assumptions,
ys the role of the weight deca decay the
y co n um
coefficient.
efficient. b er of training iterations τ pla ys a role
inversely proportional to the L regularization parameter, and the inverse of τ 
Parameter
plays the role of values
the wcorresp
corresponding
onding
eight deca to directions of significant curv
y coefficient. curvature
ature (of the
ob
objectiv
jectiv
jectivee function) are regularized less than directions of less curv curvature.
ature. Of course,
Parameter
in the con
context
text ofvalues
early corresp onding
stopping, to directions
this really means that of significant
parameterscurv thatature (of the
correspond
ob jectiv
to e function)
directions are regularized
of significan
significant t curv
curvature
atureless tend
than to directions
learn earlyof less curvature.
relative Of course,
to parameters
in the
corresp con
correspondingtext of early stopping, this
onding to directions of less curv really
curvature.
ature.means that parameters that correspond
to directions of significant curvature tend to learn early relative to parameters
Theonding
corresp deriv
derivations
ations in this section
to directions hav
haveeature.
of less curv shown that a tra jectory of length τ ends
trajectory
2
at a point that corresponds to a minim minimum um of the L -regularized ob objectiv
jectiv
jective.
e. Early
The derivations in this section hav e shown
stopping is of course more than the mere restriction of the tra that a tra jectory ofjectory τlength;
length
trajectory ends
at a point that corresponds
instead, early stopping typically into a minim
inv volv um
olves of the L -regularized ob
es monitoring the validation set error in jectiv e. Early
stopping
order is ofthe
to stop course
tra more at
trajectory
jectory than the mere restriction
a particularly goo
good d pointofin the tra jectory
space. length;
Early stopping
instead, early
therefore stopping
has the adv typically
advantage
antage involvtesdecay
over weigh
weight monitoring
that early the stopping
validation set error in
automatically
order to stop the tra jectory at a particularly
determines the correct amount of regularization while weight decagoo d p oint in space.
decay Early
y requiresstopping
man
many y
therefore
training exp has the
experimen
erimen
erimentsadv antage o ver
ts with differen weigh t decay that
differentt values of its hyperparameter. early stopping automatically
determines the correct amount of regularization while weight decay requires many
training experiments with different values of its hyperparameter.
7.9 Parameter Tying and Parameter Sharing

7.9
Th
Thus Pin
us far, arameter
this chapter,Twhen
yingweandhavee P
hav arameter
discussed addingSharing
constraints or penalties
to the parameters, we ha hav
ve alw
alwa
ays done so with resp
respect
ect to a fixed region or point.
Th
Forusexample,
far, in this
L2 cregularization
hapter, when (or
we hav e discussed
weight deca y) padding
decay) enalizesconstraints
mo
model or penalties
del parameters for
to the parameters, w e ha ve alwa ys done
deviating from the fixed value of zero. Ho so with
Howev
wev resp
wever, ect to a fixed
er, sometimes we ma region
may or p oint.
y need other
For example, L regularization
ways to express our prior kno (or
knowledge w eight deca y) p enalizes mo
wledge about suitable values of the mo del parameters
model for
del parameters.
deviating from the fixed value of zero. However, sometimes we may need other
ways to express our prior knowledge about 251 suitable values of the mo del parameters.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Sometimes we might not know precisely what values the parameters should tak takee
but we know, from knowledge of the domain and mo model del architecture, that there
Sometimes
should we might
be some not knowbetw
dependencies precisely
een thewhat
etween mo
model values
del the parameters should take
parameters.
but we know, from knowledge of the domain and model architecture, that there
shouldA common
be sometdependencies
yp
ypee of dep dependency
endency
between thatthewemo often want to express is that certain
del parameters.
parameters should be close to one another. Consider the following scenario: we
ha
hav veA twcommon
twoo momodels typpeerforming
dels of dependency the same thatclassification
we often want to express
task (with the is that
samecertain
set of
parameters should b e close to one another.
classes) but with somewhat different input distributions. Formally Consider the following
ormally,, we ha scenario:
have
ve model we
Ahavwith
e twoparameters
models performingw (A) and the mo
modelsame
del classification
B with parameters taskw(with the tsame
(B ) . The wo mo set
models of
dels
classes) but
map the input to tw with somewhat
two different
o differen
different, input distributions. F ormally
t, but related outputs: yŷˆ (A) = f( w(A), x) and, we ha ve model
A ( with
B )
yŷˆ = g(w , x).parameters
( B ) w and mo del B with parameters w . The two models
map the input to two different, but related outputs: yˆ = f( w , x) and
Let
yˆ = g(w , x).us imagine that the tasks are similar enough (p
(perhaps
erhaps with similar input
and output distributions) that we believe the mo model
del parameters should be close
Let us imagine that (A) the tasks are similar (Benough
) (perhaps with similar input
to eaceach h other: ∀i , wi should be close to wi . We can lev leverage
erage this information
and output distributions) that we believe the model parameters should be close
through regularization. Sp Specifically
ecifically
ecifically,, we can use a parameter norm penalty of the
to eac h other: i , w
w , w ) = kw(A) −
( A) ( B ) should bewclose
(B )k 2to w . We can leverage this yinformation
2 enalty
form: Ω( Ω(w 2. Here we used an L p enalt , but other
cthrough
hoices are regularization.
also∀ possible.Specifically, we can use a parameter norm penalty of the
form: Ω(w , w ) = w w . Here we used an L penalty, but other
This kind of approach was prop proposed
osed by Lasserre et al. (2006), who regularized
choices are also possible.k − k
the parameters of one mo model,
del, trained as a classifier in a sup supervised
ervised paradigm, to
This kind of approach
be close to the parameters of another mo w as prop osed by
model, Lasserre et al.
del, trained in an unsup (2006 ),ervised
who regularized
unsupervised paradigm
(tothe capture
parameters of one model,oftrained
the distribution the observas aedclassifier
observed in a sup
input data). The ervised paradigm,
architectures wereto
be close to the
constructed parameters
such that many of another model, trained
of the parameters in thein an unsupervised
classifier mo
modeldel paradigm
could be
(to capture the distribution of
paired to corresponding parameters in the unsup the observ ed input
unsuperviseddata).
ervised mo The
model.del. architectures were
constructed such that many of the parameters in the classifier model could be
While a parameter norm penalt enalty y is one way to regularize parameters to be
paired to corresponding parameters in the unsupervised model.
close to one another, the more popular way is to use constraints: to force sets of
While a parameter
parameters to be equal norm penalt
. This y is one
method way to regularize
of regularization parameters
is often referred to to base
close
par
arameterto one
ameter sharinganother, the
sharing,, where we in more p opular
interpret w ay is
terpret the various moto use constraints:
models
dels or mo model to force
del comp
componen sets
onen
onents ts of
as
parameters to b e equal . This
sharing a unique set of parameters. A significan method of regularization
significantt adv advantage is often referred
antage of parameter sharing to as
opar
ver ameter sharing,the
regularizing where we interpret
parameters to bethe various
close (via mo dels orpenalt
a norm model
enalty) y) comp
is thatonen ts as
only a
sharing a unique set of parameters. A significan
subset of the parameters (the unique set) need to be stored in memory t adv antage of parameter sharing
memory.. In certain
o
mov er regularizing
models—suc
dels—suc
dels—such the
h as the con parameters
conv to b e
volutional neural netwclose (via a
network—this norm p
ork—this can lead enalt y) istothat only a
significant
subset of the parameters (the
reduction in the memory footprint of the model. unique set) need to b e stored in memory . In certain
models—such as the convolutional neural network—this can lead to significant
reduction in the memory footprint of the model.
Con
Convolutional
volutional Neural Netw Networksorks By far the most popular and extensiv extensivee use
of parameter sharing occurs in convolutional neur neuralal networks (CNNs) applied to
Convolutional
computer vision. Neural Networks By far the most popular and extensive use
of parameter sharing occurs in convolutional neural networks (CNNs) applied to
Natural images ha haveve many statistical prop properties
erties that are inv invarian
arian
ariantt to translation.
computer vision.
For example, a photo of a cat remains a photo of a cat if it is translated one pixel
Natural images have many statistical properties that are invariant to translation.
For example, a photo of a cat remains a photo of a cat if it is translated one pixel
252
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

to the righ
right.
t. CNNs take this property into accoun accountt by sharing parameters across
multiple image lo locations.
cations. The same feature (a hidden unit with the same weigh weights)
ts)
to the right. CNNs
is computed over differen take this
differentt lo property
locations into accoun t by sharing parameters
cations in the input. This means that we can find a across
m ultiple image lo cations.
cat with the same cat detector The same feature
whether the(acat
hidden
app unit at
appears
ears with the same
column i or weigh
columnts)
is computed over
i + 1 in the image. differen t lo cations in the input. This means that w e can find a
cat with the same cat detector whether the cat appears at column i or column
Parameter sharing has allowallowed
ed CNNs to dramatically lo low
wer the numnumb ber of unique
i + 1 in the image.
mo
model
del parameters and to significantly increase netw network
ork sizes without requiring a
P
corresparameter
corresponding sharing has allow ed CNNs to dramatically
onding increase in training data. It remains one lowoferthe
the bnum
est bexamples
er of unique
of
mo
ho
how del parameters
w to effectiv
effectively and
ely incorpto significantly
incorporate increase
orate domain knowledge in netw
intoork sizes
to the net
netw without
work arc hitecture. a
requiring
architecture.
corresponding increase in training data. It remains one of the best examples of
howCNNs will ely
to effectiv be discussed
incorporate in more
domaindetail in Chapter
knowledge into 9the
. network architecture.

CNNs will be discussed in more detail in Chapter 9.


7.10 Sparse Represen
Representations
tations

7.10
W eigh Sparse
eightt deca
decayy acts byRepresen tations
placing a penalty directly on the mo model
del parameters. Another
strategy is to place a penalty on the activ activations
ations of the units in a neural net network,
work,
W eigh t deca y acts by
encouraging their activ placing
activations a penalty directly on the mo del parameters.
ations to be sparse. This indirectly imposes a complicated Another
strategy
penalt
enalty is to place a p enalty on the activations of the units in a neural network,
y on the model parameters.
encouraging their activations to be sparse. This indirectly 1
imposes a complicated
W e ha
have
ve already discussed
penalty on the model parameters. (in Sec. 7.1.2 ) ho
how
w L p enalization induces a sparse
parametrization—meaning that man many y of the parameters become zero (or close to
W e ha ve
zero). Represenalready
Representationaldiscussed
tational sparsity (in Sec.
sparsity,, on the 7.1.2other L penalization
) howhand, describ
describeses ainduces a sparse
represen
representation
tation
parametrization—meaning that man
where many of the elements of the represen y of the parameters
representation b ecome zero (or
tation are zero (or close to zero).close to
zero).
A Represen
simplified viewtational
of thissparsity , on the
distinction can other hand, describ
be illustrated in thees con
a represen
text of tation
context linear
where many
regression: of the elements of the represen tation are zero (or close to zero).
A simplified view of this distinction can be illustrated inthe con  text of linear
regression:     2
18 4 0 0 −2 0 0  
 5   0 0 −1 0 3 0   3 2 
 18     −2 
 15  =  04 50 00 0 2 0
0 0
0   3 
 
 5  0 0 
 −9   1 0 01 − −01 03 −04   −52  (7.46)
153 = 10 05 −00  1 
− 00 −05 00 (7.46)
9 1 0 0 1 0 4 − 45
 
 ∈−R
y m
3   1 0 A0∈ R− m× n
0 5 −0  x − ∈1 R n
 4 
  −R   
y A
R −  x 0 R
−14

 3
 −1 2 −5 4 1  
  2 

 1 ∈
   ∈ 3    ∈ 
 14    43 21 −23 −15 14   00 
12   
 1  =  −1 5 4 2 − 3 − 
 1   4 2 3 1 1 3   02  (7.47)
 −2   3 −1 2 − −3 0 −3   0 
1 = −51 45 −  −3 
23 −42 −22 −53 −12
2 −3 1 2 3 −0 − 3 00 (7.47)
×n  3 n
y ∈23Rm  5 4 B ∈2Rm− 2 5 −1  h ∈ R
−0 
y R  − B − R253 − −   h R
     
     
 ∈   ∈   ∈ 
     
 
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

In the first expression, we ha haveve an example of a sparsely parametrized linear


regression mo
model.
del. In the second, we ha have
ve linear regression with a sparse representa-
In the first expression, we ha ve an example
tion h of the data x . That is, h is a function of of a sparsely
x that, parametrized
in some linear
sense, represen
represents
ts
regression mo del. In
the information presen the second, w e
presentt in x, but do ha
doesve linear regression
es so with a sparse vector. with a sparse representa-
tion h of the data x . That is, h is a function of x that, in some sense, represents
Represen
Representational
tational regularization is accomplished by the same sorts of mechanisms
the information present in x, but does so with a sparse vector.
that we ha
have
ve used in parameter regularization.
Representational regularization is accomplished by the same sorts of mechanisms
Norm penalt
enalty
y regularization of represen
representations
tations is performed by adding to the
that we have used in parameter regularization.
loss function J a norm penalty on the repr epresentation
esentation
esentation,, denoted Ω(Ω(hh). As before, we
Norm p enalty regularization of represen
denote the regularized loss function by J : ˜ tations is p erformed by adding to the
loss function J a norm penalty on the representation, denoted Ω(h). As before, we
denote the regularized loss ; X , y) =bJy (Jθ˜:; X , y) + αΩ(h)
J˜(θfunction (7.48)

where α ∈ [0, ∞) weigh tsJ˜the


eights (θ ; X , y) = con
relative J (θtribution
; X , y) +ofαΩ(
contribution theh)norm penalt
enalty y term,(7.48)
with
larger values of α corresponding to more regularization.
where α [0, )1weights the relative contribution of the norm penalty term, with1
Just as an L penalt enalty y on the parameters induces parameter sparsity sparsity,, an L
larger values
∈ ∞ of α corresponding to more regularization.
penalt
enaltyy on the elements of the representation
Ppenalty on the parameters induces representational sparsit
sparsity: y:
Ω(
Ω(h Just as an L
h ) = ||h||1 = i |hi|. Of course, the L penalt 1 induces
enalty parameter sparsity
y is only one choice of penalty , an L
p enalt y on the elements
that can result in a sparse represenof the representation
tation. Others includerepresentational
representation. induces the penalty derived sparsit
fromy:
Ω( h ) = h = h . Of course, the L p enalt y is only
Student-tt prior on the representation (Olshausen and Field, 1996; Bergstra, 2011)
a Student- one choice of p enalty
that KL
and can||divergence
result
|| in a sparse | represen
|penalties (Laro tation.
chelle Others
Larochelle include
and Bengio the )penalty
, 2008 that are derived
esp from
especially
ecially
a Student-
useful for trepresen
prior ontations
the representation
representations with elements (Olshausen
constrainedand Field
to lie, 1996
on the; Bergstra , 2011
unit interv
interval.al.)
and et
Lee KL al.divergence
(2008) andpGo enalties
Goo odfellow(Laro chelle
et al. (2009and
) bBengio
oth provide , 2008examples
) that are ofPesp ecially
strategies
useful for represen Ptations with elements constrained to lie on the unit 1 interv al.
based on regularizing the av average
erage activ
activation
ation across sev eral examples, m i h(i) , to
several
Lee
b et al.some
e near (2008 ) andvGo
target odfellow
alue, such as et aal. (2009with
vector ) both .01provide
for eac
eachexamples
h entry.. of strategies
entry
based on regularizing the average activation across several examples, h , to
Other approac
approaches hes obtain represen
representational
tational sparsit
sparsity y with a hard constrain
constraintt on
be near some target value, such as a vector with .01 for each entry.
the activ
activation
ation v values.
alues. For example, ortho orthogonal
gonal matching pursuit (Pati et al.,
1993 Other
1993)) enco
encodesapproac hes obtain represen
des an input x with the representationtational sparsith ythatwith a es
solv hard
solves theconstrain
constrained t on
the activationproblem
optimization values. For example, orthogonal matching pursuit (Pati et al.,
P
1993) encodes an input x witharg the
min kx − W hk2 , h that solves the constrained
representation (7.49)
optimization problem h,khk0 <k
arg min x W h , (7.49)
where khk0 is the num numb ber of non-zero entries of h . This problem can be solved
efficien
efficiently tly when W is constrained to bke orthogonal. − k This method is often called
where
OMP-
OMP-k h is the num
k with the value of k sp b er of non-zero
specified entries of h
ecified to indicate the number . This problem can befeatures
of non-zero solved
efficien
allo
allow
wed. k when W
ktlyCoates andisNg constrained to be orthogonal.
(2011) demonstrated that OMP- This1 method
can be aisveryoften called
effective
OMP-k extractor
feature with the vfor alue of karc
deep sphitectures.
ecified to indicate the number of non-zero features
architectures.
allowed. Coates and Ng (2011) demonstrated that OMP-1 can be a very effective
Essen
Essentially
tially ananyy mo
model
del that has hidden units can be made sparse. Throughout
feature extractor for deep architectures.
this bo
book,
ok, we will see man many y examples of sparsit
sparsityy regularization used in a variety of
con Essentially any model that has hidden units can be made sparse. Throughout
contexts.
texts.
this book, we will see many examples of sparsity regularization used in a variety of
contexts. 254
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.11 Bagging and Other Ensemble Metho


Methods
ds

7.11 (short
Bagging Bagging for bootstrand
otstrap Other
ap aggr
aggre egatingEnsemble
) is a techniqueMetho for reducing ds generalization
error by combining several mo models
dels (Breiman, 1994). The idea is to train several
Bagging
differen
differentt mo (short
models for
separately,, theneha
bo otstr
dels separately ap aggr gating
haveve all) isofathe
technique
models for votereducing
on the outputgeneralization
for test
examples. This is an example of a general strategy in machine learning calledseveral
error b y combining several mo dels ( Breiman , 1994 ). The idea is to train mo
modeldel
differen
aver
averaging
aging
aging.t mo
. Tdels separately
echniques , then ha
employing ve all
this of the are
strategy models known voteason the output
ensemble fords
metho
methods ds.test
.
examples. This is an example of a general strategy in machine learning called model
averThe
agingreason that mo
. Techniques model
del averaging
employing this works
strategy is are
thatknowndifferent mo
models
as ensemble dels will usually
metho ds.
not mak
makee all the same errors on the test set.
The reason that model averaging works is that different models will usually
Consider for example a set of k regression models. Supp Suppose ose that eac h model
each
not make all the same errors on the test set.
mak
makes es an error i on each example, with the errors drawn from a zero-mean
Consider for example a set of k regression models.
[2i ] = vSupp oseariances
that eacEh[model
multiv
ultivariate
ariate normal distribution with variances E and cov covariances i j ] =
mak
c. P es an error  on
Then the error made by the av each example,
average with the
erage prediction errors drawn from a zero-mean
E of all the ensem ensemble ble momodels
E dels is
m1 ultivariate normal distribution with variances [ ] = v and covariances [  ] =

i i . The exp expected
ected squared error of the ensem ensemble ble predictor is
ck. Then the error made by the average prediction of all the ensemble models is
 
! 2error   
 . The expected squared X of the X ensemble predictor
X is
 1  1   2  
E i = 2E i + i j (7.50)
k k
E 1 i 1 E i j 6=i
 =  +  (7.50)
k 1k k−1
P = v+ c. (7.51)
k k
1 k 1
 errors are  = v +  c. and c = v, the mean squared (7.51)
In the case where the ! perfectly
k k
correlated

error reduces to v, so theX model av averaging
eragingdo Xes not help
does 
X at all. In the case where
In the
the caseare
errors where the errors
perfectly are perfectly
uncorrelated and ccorrelated
= 00,, the exp expectedc =
andected , the mean
vsquared errorsquared
of the
error
ensem
ensemblereduces to v1, so the model
ble is only k v. This means that the exp av eraging do es
expected not help at all. In
ected squared error of the ensemblethe case where
the errorslinearly
decreases are perfectly
with the uncorrelated
ensem
ensemble and cIn=other
ble size. 0, thewords,
expectedon avsquared
erage, the error of the
ensem
ensemble ble
ensem
will ble is only
perform at least v. This means
as well as anythatof theitsexp ected squared
members, and iferror the of the ensemble
members make
decreases
indep
independen
enden
endent linearly
t errors,with the the ensemble
ensemble willsize. In other
p erform words, on
significantly average,
better thanthe ensem
its mem
memb ble
bers.
will perform at least as well as any of its members, and if the members make
indepDifferen
Different
endentt errors,
ensem
ensemble ble
themetho
methods
ensembleds construct
will p erform the ensem
ensemble ble of models
significantly better than in differen
different
its memt wbaers.
ys.
For example, eac eachh member of the ensem
ensemble ble could b e formed by training a completely
Differen
differen
different t kind t ensem
of modelble metho
usingds constructalgorithm
a different the ensemor bleob ofjective
modelsfunction.
objective in differen t ways.
Bagging
Fora example,
is metho
method eachallo
d that member
allowsws theofsame
the ensem
kind bleof mocould
del,btraining
model, e formed algorithm
by trainingand a completely
ob
objectiv
jectiv
jectivee
differen t kind of model
function to be reused several times.using a different algorithm or ob jective function. Bagging
is a method that allows the same kind of model, training algorithm and ob jective
Sp
Specifically
function ecifically
ecifically, bagging
to be ,reused inv
involves
several olves
times.constructing k differen differentt datasets. Each dataset
has the same num umb ber of examples as the original dataset, but eac eachh dataset is
Sp ecifically , bagging inv
constructed by sampling with replacemen olves k
replacementt from the original dataset. Thisdataset
constructing differen t datasets. Each means
has the same n um
that, with high probability b er of examples
probability,, eac each as the original dataset,
h dataset is missing some of the examples from thebut eac h dataset is
constructed
original by sampling
dataset and also with con replacemen
contains
tains several tduplicate
from the examples
original dataset.
(on av Thisaround
average
erage means
that,ofwith
2/3 high probability
the examples from the, eacoriginal
h dataset is missing
dataset some of
are found in the
the examples
resulting from
training the
original dataset and also contains several duplicate examples (on average around
2/3 of the examples from the original dataset 255 are found in the resulting training
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Original dataset

First resampled dataset First ensemble member

Second resampled dataset Second ensemble member

Figure 7.5: A carto


cartoon
on depiction of ho how w bagging works. Supp Suppose
ose we train an ‘8’ detector
on the dataset depicted ab abov
ov
ove,
e, con
containing
taining an ‘8’, a ‘6’ and a ‘9’. Supp Supposeose we makmakee twtwoo
Figure
differentt7.5:
differen A cartodatasets.
resampled on depictionTheofbagging
how bagging
trainingworks.
pro Suppis
procedure
cedure osetowconstruct
e train aneac ‘8’h detector
each of these
on the dataset
datasets depicted
by sampling withabreplacement.
ove, containing Thean ‘8’,dataset
first a ‘6’ and a ‘9’.
omits the Supp oserep
‘9’ and weeats
makthe
repeats e tw o
‘8’.
differen
On thist dataset,
resampledthedatasets.
detector The bagging
learns that atraining
lo op onpro
loop topcedure
of theisdigit
to construct
corresp
corresponds eachtoofan
onds these
‘8’.
datasets
On by sampling
the second dataset,with
we replacement.
rep eat the ‘9’ The
repeat and first
omitdataset
the ‘6’. omits
In thisthe ‘9’ and
case, repeats the
the detector ‘8’.
learns
On this
that a lodataset,
loop
op on thethebottom
detectoroflearns thatcorresp
the digit a lo op onds
on top
corresponds to of
anthe
‘8’.digit
Eac
Each corresp
h of theseondsindividual
to an ‘8’.
On the secondrules
classification dataset, we repbut
is brittle, eat the
if we‘9’average
and omit the output
their ‘6’. In this
thencase, the detector
the detector learns
is robust,
that
ac a lo
achieving op on the bottom of the digit corresp onds to an
hieving maximal confidence only when b oth loops of the ‘8’ are presen ‘8’. Eac h of
present.these
t. individual
classification rules is brittle, but if we average their output then the detector is robust,
achieving maximal confidence only when b oth loops of the ‘8’ are present.
set, if it has the same size as the original). Mo del i is then trained on dataset
Model
i. The differences betw between
een which examples are included in eac eachh dataset result in
set, if it
differences bethas the
etw same size as
ween the trained mo the i
dels. See Fig. 7.5 for an example. on dataset
original).
models. Mo del is then trained
i. The differences between which examples are included in each dataset result in
Neural net netwworks reac
reach h a wide enough variety of solution points that they can
differences between the trained models. See Fig. 7.5 for an example.
often benefit from mo model
del averaging ev even
en if all of the mo
models
dels are trained on the same
Neural net w orks reac h a wide enough v ariety
dataset. Differences in random initialization, random selection of solution pointsofthat they hes,
minibatc can
minibatches,
often benefit
differences in hyp from mo del av
yperparameters,eraging ev
erparameters, or differenen if all of the mo dels are trained on
differentt outcomes of non-deterministic imple-the same
dataset.
men
mentations Differences
tations of neural net in
netwworks are often enough random
random initialization, to cause selection
differen of minibatc
differentt members of hes,
the
differences
ensem
ensemble ble toinmakhypeerparameters,
make partially indep orendent
differen
independent t outcomes of non-deterministic imple-
errors.
mentations of neural networks are often enough to cause different members of the
ensemMo
Model deltoavmak
ble eraging is an extremely
e partially independent pow
owerful
erful and reliable metho
errors. method d for reducing
generalization error. Its use is usually discouraged when benc enchmarking
hmarking algorithms
Mo
for scien del
scientific av eraging
tific pap
papers, is an extremely
ers, because an any p ow erful and reliable metho
y machine learning algorithm can benefit d for reducing
substan-
generalization
tially from mo error.
model
del av Its use isatusually
averaging
eraging discouraged
the price whencomputation
of increased benchmarking algorithms
and memory
memory..
for
F or scien
this tific papbenchmark
reason, ers, becausecomparisons
any machineare learning
usuallyalgorithm
made using canabenefit substan-
single mo
model.
del.
tially from model averaging at the price of increased computation and memory.
Mac
Machinehine learning con contests
tests are usually won by metho methods ds using momodel
del av
averag-
erag-
For this reason, benchmark comparisons are usually made using a single model.
ing ovoverer dozens of mo
models.
dels. A recent prominen
prominentt example is the Netflix Grand
PrizeMac (Khine
oren,learning
2009). contests are usually won by methods using model averag-
ing over dozens of models. A recent prominent example is the Netflix Grand
Prize (Koren, 2009). 256
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Not all tec


techniques
hniques for constructing ensem
ensembles
bles are designed to make the ensemble
more regularized than the individual mo models.
dels. For example, a technique called
Not all
boosting techniques
(Freund for constructing
and Schapire , 1996b,a)ensem bles are
constructs andesigned
ensem bletowith
ensemble make the ensemble
higher capacity
more regularized than the
than the individual models. Bo individual
Boosting mo dels. For example,
osting has been applied to build ensem a technique
ensembles called
bles of neural
bnet
oosting
netw ( F
works (Screund
Sch and Schapire , 1996b , a ) constructs
hwenk and Bengio, 1998) by incremen an
incrementallyensem ble with higher
tally adding neural netw capacity
networks
orks to
thanensemble.
the the individual
Bo models.
Boosting
osting hasBo osting
also been hasapplied
been applied
in to build
interpreting
terpreting anensem bles of neural
individual neural
netw
net works
netw ork as(Sc hwensem
an enk and
ble Bengio
ensemble (Bengio , 1998 ) ,b2006a
et al. y incremen tally addingadding
), incrementally neuralhidden
networks to
units
thethe
to ensemble. Bowork.
neural net osting has also been applied interpreting an individual neural
network.
network as an ensemble (Bengio et al., 2006a), incrementally adding hidden units
to the neural network.
7.12 Drop
Dropout
out

7.12
Dr
Drop
op out (Drop
opout Sriv
Srivastav
astavout
astava a et al.
al.,, 2014) provides a computationally inexpensive but
powerful methomethod d of regularizing a broad family of mo models.
dels. To a first approapproximation,
ximation,
Dr op
drop
dropoutout ( Sriv astav a et al.
out can be thought of as a metho , 2014 ) provides
method a computationally inexpensive
d of making bagging practical for ensembles but
pow
of erfulmany
very metholarged of regularizing
neural net a broadBagging
networks.
works. family of mo
inv dels.training
involves
olves To a first approximation,
multiple mo
models,
dels,
drop
and ev out can
evaluating b e thought
aluating multiple mo of as
models a metho d of making bagging practical
dels on each test example. This seems impractical for ensembles
of v ery
when eac eachmany
h mo large
model neural net
del is a large neural works.netwBagging
network,
ork, sinceinvolves training
training and m evultiple
evaluating
aluating mosuch
dels,
and
net
netw ev aluating m ultiple mo dels
works is costly in terms of runtime and memoryon each test example. This seems
memory.. It is common to use ensem impractical
ensembles
bles
when
of five eac h mo
to ten del isnetw
neural a large
networks—
orks— neural
Szegedynetw
et ork, since )training
al. (2014a used six and evaluating
to win the ILSVR
ILSVRC—such
C—
net w orks is costly in terms
but more than this rapidly becomes un of runtime and memory
unwieldy
wieldy
wieldy.. Drop . It
Dropout is common to
out provides an inexpuse ensem
inexpensivebles
ensive
of five
appro to
approximation ten neural netw
ximation to training and evorks— Szegedy
evaluating et al. ( 2014a
aluating a bagged ensem ) used
ensemble six to win the ILSVR
ble of exponentially many C—
but more
neural net netw than
works. this rapidly b ecomes un wieldy . Drop out provides an inexp ensive
approximation to training and evaluating a bagged ensemble of exponentially many
Sp
Specifically
ecifically
ecifically,, dropdropout
out trains the ensemble consisting of all sub-netw sub-networks
orks that
neural networks.
can be formed by remo removing
ving non-output units from an underlying base netw network,
ork,
Sp ecifically , drop out
as illustrated in Fig. 7.6. In most motrains the ensemble
modern consisting
dern neural net netw of all sub-netw orks
works, based on a series of that
can b e formed by remo ving non-output
affine transformations and nonlinearities, we can effectively units from an underlying
remo
removeve base
a unitnetw
from ork,
a
as
net
netwillustrated
work by min Fig. 7.6its
ultiplying . Inoutput
most mo dern
value byneural
zero. netThisworks,
pro
procedurebased requires
cedure on a seriessomeof
affinet transformations
sligh
slight mo dification for and
modification modelsnonlinearities,
such as radial we can
basiseffectively
function net remo
netw ve a unit
works, whic
which from
takeae
h tak
netwdifference
the ork by mbultiplying
et
etwween the itsunit’s
output value
state andbsome
y zero. This pro
reference cedure
value. requires
Here, somet
we presen
present
sligh t
the drop mo
dropout dification for models such as radial basis function
out algorithm in terms of multiplication by zero for simplicity net
simplicity,, but ittak
w orks, whic h cane
the
b difference
e trivially betweentothe
modified workunit’s
withstate andoperations
other some reference value.
that remo
remov veHere,
a unitwefrom
presen
thet
thewdrop
net
netw ork.out algorithm in terms of multiplication by zero for simplicity, but it can
be trivially modified to work with other operations that remove a unit from the
Recall that to learn with bagging, we define k differen differentt mo dels, construct k
models,
network.
differen
differentt datasets by sampling from the training set with replacemen replacement, t, and then
trainRecall
modelthat i onto learn iwith
dataset . Drop bagging,
Dropout out aimswetodefine
appro k differen
approximate
ximate this t mo
pro dels,
cess, construct
process, but with an k
differen
exp
exponen
onen ttially
onentiallydatasets
largebn yumsampling
umb from the
ber of neural netw training
networks.
orks. Sp set with ,replacemen
Specifically
ecifically
ecifically, to train witht, and
drop then
dropout,
out,
train model i on dataset i . Drop out
we use a minibatch-based learning algorithm that makaims to appro ximate
makesthis pro cess, but
es small steps, suc with
such h an
as
exp
sto onen tially
stocchastic gradien large n um b er of neural netw orks.
gradientt descent. Each time we load an example in Sp ecifically , to train
into with
to a minibatc drop
minibatch, out,
h, we
we use a minibatch-based learning algorithm that makes small steps, such as
stochastic gradient descent. Each time257 we load an example into a minibatch, we
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

y y y y

h1 h2 h1 h2 h1 h2 h2

x1 x2 x2 x1 x1 x2

y
y y y y

h1 h1 h2 h2
h1 h2
x1 x2 x1 x2 x2

y y y y
x1 x2

h1 h1 h2
Base network
x1 x2 x1 x1

y y y y

h2 h1

x2

Ensemble of Sub-Networks

Figure 7.6: Drop Dropout


out trains an ensem
ensemble
ble consisting of all sub-netw
sub-networks
orks that can b e
constructed by removing non-output units from an underlying base net netw work. Here, we
Figure 7.6: Drop
b egin with a base net out trains
network an ensem ble consisting
work with two visible units and tw of
twoo hidden units. Therethat
all sub-netw orks are can be
sixteen
pconstructed by removing
ossible subsets non-output
of these four units. Weunits
sho
show wfrom an underlying
all sixteen subnet basethat
subnetworks
works netwmaork.
may Here,
y be we
formed
byegin with aout
dropping basedifferen
network
different with two
t subsets visible
of units units
from theand two hidden
original netw units.
network.
ork. There
In this areexample,
small sixteen
apossible subsets
large prop of these
proportion
ortion of thefour units. net
resulting Weworks
show all
networks ha
havesixteen
ve subnet
no input works
units or that
no pathmayconnecting
be formed
by dropping
the input tooutthedifferen
output.t subsets of units from
This problem the original
b ecomes network.
insignificant for In
netwthis
networks small
orks example,
with wider
a
la ylarge
lay prop ortion
ers, where of the resulting
the probability networks
of dropping have no
all possible input
paths fromunits or no
inputs path connecting
to outputs becomes
the input to the output. This problem b ecomes insignificant for networks with wider
smaller.
layers, where the probability of dropping all possible paths from inputs to outputs becomes
smaller.

258
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

h1 h2

x1 x2

hˆ1 hˆ2

µ h1 h1 h2 µ h2

xˆ1 xˆ2

µx1 x1 x2 µ x2

Figure 7.7: An example of forward propagation through a feedforward netw network


ork using
drop
dropout.
out. (T op) In this example, we use a feedforward netw
(Top) network
ork with two input units, one
Figure 7.7:
hidden layer An
layer withexample
two hiddenof forward propagation
units, and one output through a feedforward
unit. (Bottom) network
To perform using
forw
forward
ard
drop out. (Top)
propagation withIndropout,
this example, we use asample
we randomly feedforward
a vectornetw ork with
µ with one ten
wotryinput
entry units,
for each one
input
hidden
or hiddenlayunit
er with twonet
in the hidden
netw work. units, and one
The entries output
of µ are binary (Bottom)
unit. and To perform
are sampled forward
independently
propagation with dropout, w e randomly
from each other. The probability of each en sample
entry a vector µ with
try b eing 1 is a hyp one en try
yperparameter, for each
erparameter, usually input
0.5
or hidden
for unit in
the hidden laythe
lay ersnet
andwork. Thethe
0.8 for entries
input. µ
of Eac
areh binary
Each unit inand theare
netsampled
netwwork is m independently
ultiplied by
fromcorresp
the each other.
ondingThe
corresponding probability
mask, and then of each en
forward try b eing 1 con
propagation is atinues
hyp erparameter,
continues through the usually 0.5
rest of the
for
net
netwwthe
orkhidden layThis
as usual. ers and 0.8 for
is equiv thetoinput.
equivalent
alent Eachselecting
randomly unit in theone net work
of the is multiplied
sub-net
sub-networks
works fromby
the corresp
Fig. 7.6 andonding
running mask, and then
forward forward through
propagation propagation
it. continues through the rest of the
network as usual. This is equivalent to randomly selecting one of the sub-networks from
Fig. 7.6 and running forward propagation through it.
259
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

randomly sample a differen differentt binary mask to apply to all of the input and hidden
units in the netw network.
ork. The mask for each unit is sampled indep independently
endently from all of
randomly sample a differen t binary mask
the others. The probability of sampling a mask value of one (causing to apply to all of the input aand unithidden
to be
units in the netw ork. The mask for each unit is
included) is a hyperparameter fixed before training begins. It is not a functionsampled indep endently from all of
the others.
of the curren The probability
currentt value of the mo of sampling
model a mask value of one
del parameters or the input example. Typically(causing a unit to be,
ypically,
included)
an input unit is a ishyperparameter
included with probabilityfixed before0.8 training begins.unit
and a hidden It isisnot a function
included with
of the
probabilit
probability curren t value of
y 0.5. We then run forwthe mo del
forwardparameters or
ard propagation, bac the inputk-propagation, and the,
example.
back-propagation, T ypically
an inputup
learning unit
update
date is as
included
usual. Fig. with 7.7probability
illustrates0.8 howandtoarun hidden
forw
forwardunitpropagation
ard is included with with
probabilit
drop
dropout.
out. y 0.5. W e then run forw ard propagation, bac k-propagation, and the
learning update as usual. Fig. 7.7 illustrates how to run forward propagation with
dropMore formally,, suppose that a mask vector µ sp
out. formally specifies
ecifies whicwhich h units to include,
and J (θ, µ) defines the cost of the mo modeldel defined by parameters θ and mask µ .
More
Then dropdropout formally , suppose that
out training consists in minimizing a mask vectorEµµJ (spθ,ecifies
µ). The whic
exph ectation
units to con
expectation include,
contains
tains
and
exp
exponenJ
onen( θ ,
onentiallyµ ) defines
tially manmany the cost of the mo
y terms but we can obtainEan un del defined by
unbiased parameters θ and
biased estimate of its gradien mask µt.
gradient
Then dropoutvalues
by sampling training of µ consists
. in minimizing J ( θ, µ). The expectation contains
exponentially many terms but we can obtain an unbiased estimate of its gradient
Drop
Dropout
by sampling out vtraining
alues of is µ.not quite the same as bagging training. In the case of
bagging, the mo models
dels are all indep independent.
endent. In the case of drop dropout,
out, the mo models
dels share
Drop out
parameters, with eac trainingeach is
h mo not
model quite the same as bagging training.
del inheriting a different subset of parameters from In the casethe
of
bagging,
paren
parent the mo
t neural netdels
network.
work. are This
all indep endent.sharing
parameter In the casemakes of it
drop out, the
possible to mo dels share
represent an
parameters,
exp
exponen
onen
onential
tial num with
umb eac h
ber of mo mo del
models inheriting a different
dels with a tractable amoun subset
amountt of memory of parameters from
memory.. In the case of the
paren t
bagging, eac neural
each net
h mo
modelwork. This parameter
del is trained to con sharing
convergence makes
vergence on its resp it
respectivpossible
ectiv to represent
ectivee training set. In thean
exp onen
case of drop tial
dropout,n um b er of mo
out, typically most mo dels with
modelsa tractable amoun t of memory
dels are not explicitly trained at all—usually . In the case
all—usually, of,
bagging,
the modeleac ishlarge
modelenough is trained
thattoitcon vergence
would on its resp
be infeasible toectiv e training
sample set. Insub-
all possible the
case
net
netw of drop
works out,the
within typically
lifetimemost of themouniv
delserse.
are Instead,
universe. not explicitlya tin
tiny ytrained
fractionatof all—usually
the possible,
the
sub-netmodel
sub-netw works is large
are eac enough
each h trainedthatforit awsingle
ould bstep,
e infeasible
and thetoparameter
sample allsharing
possible sub-
causes
net w orks within
the remaining sub-net the lifetime
sub-networks of the univ
works to arrive at go erse.
goo Instead, a tin y fraction
od settings of the parameters. Theseof the p ossible
sub-net
are the w orksdifferences.
only are each trained Bey ondforthese,
Beyond a single step, follows
dropout and thethe parameter sharing causes
bagging algorithm. For
the remaining sub-net works
example, the training set encountered by eac to arrive at go o
each d settings
h sub-netw
sub-network of the parameters.
ork is indeed a subset of These
are the only differences. Bey
the original training set sampled with replacemenond these, dropout
replacement. follows
t. the bagging algorithm. For
example, the training set encountered by each sub-network is indeed a subset of
To makmakee a prediction, a bagged ensemble must accum accumulate ulate votes from all of
the original training set sampled with replacement.
its members. We refer to this pro process
cess as infer
inferencenc
encee in this con context.
text. So far, our
T o mak e a prediction, a bagged ensemble
description of bagging and dropout has not required that the model must accum ulate votesbefrom all of
explicitly
its members. W
probabilistic. No ew,refer
Now, we to this pro
assume that cesstheasmoinfer
model’s encrole
del’s e inisthis context.a probability
to output So far, our
description of bagging and
distribution. In the case of bagging, each mo dropout has not required
del i pro
model produces that the model b e explicitly
duces a probability distribution
pprobabilistic.
(i)( y | x
). TheNo w, we assume
prediction of thethatensemblethe mo is del’s
giv enrole
given is toarithmetic
by the output a mean probability
of all
distribution.
of these distributions,In the case of bagging, each mo del i pro duces a probability distribution
p ( y x). The prediction of the ensemble k is given by the arithmetic mean of all
1 X (i)
of these | distributions, p (y | x). (7.52)
k
1 i=1
p (y x). (7.52)
In the case of drop dropout,out, each sub-mo k del defined by mask vector µ defines a prob-
sub-model
|
260
In the case of dropout, each sub-model defined by mask vector µ defines a prob-
X
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

abilit
abilityy distribution p (y | x, µ ). The arithmetic mean over all masks is giv given
en
by
ability distribution p (y x, µ )X . The arithmetic mean over all masks is given
p(µ)p(y | x, µ) (7.53)
by | µ
p(µ)p(y x, µ) (7.53)
where p( µ ) is the probabilit
probability y distribution that was used to sample µ at training
|
time.
where p( µ ) is the probability distribution that was used to sample µ at training
Because this sum includes an exp exponential
onential num numb ber of terms, it is intractable
time. X
to evevaluate
aluate except in cases where the structure of the mo model
del permits some form
Because this sum includes an exp onential
of simplification. So far, deep neural nets are not kno num b er
known of terms,
wn to permit it is
anyintractable
tractable
to ev aluate except in cases
simplification. Instead, we can appro where the structure
ximate the inference with sampling,form
approximate of the mo del p ermits some by
aofveraging
simplification.
together Sothe
far,output
deep neural
from nets
manyare not kno
masks. wn to
Even permit
10-20 any are
masks tractable
often
simplification.
sufficien
sufficientt to obtain Instead,
goo
good d pwe can approximate the inference with sampling, by
erformance.
averaging together the output from many masks. Even 10-20 masks are often
Ho
How wev
ever,
er, there is an even better approac approach, h, that allows us to obtain a go goood
sufficient to obtain good performance.
appro
approximation
ximation to the predictions of the entire ensemble, at the cost of only one
forw Ho
ardwev
forward er, there is Tan
propagation. o do even
so, bwetter approac
e change h, that
to using the allows
geometric us to obtain
mean a go
rather od
than
appro
the ximation mean
arithmetic to theofpredictions
the ensem
ensembleof the
ble mem
membentire
bers’ensemble,
predicted at the cost of only
distributions. one
Warde-
forw ard propagation. T o do
Farley et al. (2014) present argumen so, w e c
argumentshange to using the geometric mean
ts and empirical evidence that the geometricrather than
the arithmetic mean of the ensem
mean performs comparably to the arithmetic ble members’ meanpredicted
in this distributions.
context. Warde-
Farley et al. (2014) present arguments and empirical evidence that the geometric
The geometric mean of multiple probability distributions is not guaranteed to be
mean performs comparably to the arithmetic mean in this context.
a probability distribution. To guarantee that the result is a probabilit probability y distribution,
we impose the requirement that none of the sub-models assigns probability 0 totoan
The geometric mean of m ultiple probability distributions is not guaranteed bye
any
aeven
ev probability
en t, and wedistribution.
ent, renormalize the To guarantee that the result
resulting distribution. is aunnormalized
The probability distribution,
probabilit
probabilityy
w e impose thedefined
distribution requirement
directly that
by none of the sub-models
the geometric mean is assigns
giv en bprobability
given y 0 to any
event, and we renormalize the resulting distribution. sY The unnormalized probability
distribution defined directly by the geometric mean is given by
pp̃˜ensemble(y | x) = 2d p(y | x, µ) (7.54)
µ
p˜ (y x) = p(y x, µ) (7.54)
where d is the num numb ber of units that | may b e dropp
dropped.
| ed. Here we use a uniform
distribution over µ to simplify the presentation, but non-uniform distributions are
where d is theTnum
also possible. o makber
make of units that
e predictions we may
must be dropped. the Hereensemble:
we use a uniform
distribution over µ to simplify the presentation, sre-normalize
Y but non-uniform distributions are
also possible. To make predictions we must pp̃˜ensemble (y | x) the ensemble:
re-normalize
p ensemble(y | x) = P . (7.55)
y0 p p̃˜ensemble (y0 | x)
p˜ (y x)
p (y x) = . (7.55)
p˜ (|y x)
A key insight (Hinton et al., |2012c) in involv
volv
volved ed in dropout is that we can appro approxi-
xi-
p
mate ensemble by ev evaluating
aluating p( y | x ) in one model: the mo |model
del with all units, but
withAthekeywinsight
eigh
eights (Hinton
ts going outetofal. , 2012c
unit ) involvedbyinthe
i multiplied dropout is that
probabilit
probabilityy ofwincluding
e can appro xi-
unit
. Thep motiv
imate by ev
motivation
ation aluating
for this mo p(dification
y x) in one
modification is tomodel: capture thethe
morigh
del twith
right all units,
expected valuebut
of
with the w eights going out of unit | i m P
ultiplied by the probabilit y of including unit
the output from that unit. We call this approac approach h the weight sc scaling
aling infer
inferenc
enc
encee rule
rule..
i. The motivation for this modification is to capture the right expected value of
the output from that unit. We call this261 approach the weight scaling inference rule.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

There is not yet any theoretical argument for the accuracy of this approximate
inference rule in deep nonlinear net networks,
works, but empirically it performs very well.
There is not yet any theoretical argument for the accuracy of this approximate
Because we usually use an inclusion probability of 12 , the weight scaling rule
inference rule in deep nonlinear networks, but empirically it performs very well.
usually amounts to dividing the weights by 2 at the end of training, and then using
the Because
mo del aswusual.
model e usually use an winclusion
Another probability
ay to achiev
achievee the same of ,result
the wis eight scaling rule
to multiply the
usuallyofamounts
states tobdividing
the units y 2 duringthetraining.
weights bEither
y 2 at the
wayend
, theofgoal
training, andethen
is to mak
make sureusing
that
the expected
the model astotal
usual. Another
input wayattotest
to a unit achiev
timee isthe same the
roughly result
same is to
as multiply
the exp the
expected
ected
statesinput
total of thetounits
that bunit
y 2 during
at traintraining.
time, ev Either
even
en though wayhalf
, thethe
goal is toatmak
units e sure
train timethat
are
the expected total
missing on average. input to a unit at test time is roughly the same as the exp ected
total input to that unit at train time, even though half the units at train time are
For many
missing classes of mo
on average. models
dels that do not hahav
ve nonlinear hidden units, the weigh weightt
scaling inference rule is exact. For a simple example, consider a softmax regression
For many
classifier withclasses
n input of vmo dels that
ariables do not habvye the
represented nonlinear
vectorhidden
v: units, the weight
scaling inference rule is exact. For a simple example,
 consider
 a softmax regression
classifier with n inputPv(ariables >
y = y | vrepresented
) = softmaxbyW thevvector
+ b v. : (7.56)
y

We can index in
into (y = y ofvsub-mo
to thePfamily ) = softmax
dels byW
sub-models v +t-wise
elemen b . multiplication (7.56)
element-wise of the
input with a binary vector d: |
We can index into the family of sub-modelsby element-wise multiplication of the
input with a binary
P (yv= y | vd;: d) = softmax W >(d  v) + b .
ector (7.57)
  y

The ensem
ensemble P (y = yis defined
ble predictor v; d) = bsoftmax W (d the
y re-normalizing v) + b .
geometric mean (7.57)
over all
ensem
ensemble
ble mem
memb |
bers’ predictions: 
The ensemble predictor is defined by re-normalizing the geometric mean over all
ensemble members’ predictions: P̃˜
P (y = y | v)
Pensemble(y = y | v) = P ensemble  (7.58)
y˜ P̃˜ensemble(y = y 0 | v)
0 P
P (y = y v)
P (y = y v ) = (7.58)
where s Y P˜ (y = y| v)
|
P̃˜ensemble(y = y | v) = 2 n
P P (y = y | v|; d). (7.59)
where
d∈{0,1}n
P˜ (y = y v) = P (y = y v; d). (7.59)
To see that the weight scaling rule | is exact,
P we can simplify P̃˜ensemble:
| P
s Y
˜
To see that the w
P ensemble (y = y | v) = 2 n we can
˜eight scaling rule is exact, y = y | vP; d)
P (simplify : (7.60)
s Y n
d∈{0,1}
P˜ (y = y v) = P (y = y v; d) (7.60)
s Y
= n | softmax (W> (d  v) + b| ) (7.61)
2 y
d∈{0,1} n
= v softmax
 (W
s (d v) + b)
 (7.61)
u Y Y
>
u exp Wy,:(d 
v) + b
= 2n (7.62)
expWW y(>0d,: (d v
exp
d∈{0,1}n y0 ) v+) b+ b
= s Y (7.62)
262 W (
exp d v) + b
t P  

v  
u Y
u

t P  
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

qQ  > 
2n
d∈{0,1}n exp Wy,: (d  v) + b
= r   (7.63)
n Q P exp W (
>d v ) + b
y 0 exp Wy 0 ,:(d  v ) + b
2
= d∈{0,1}n (7.63)

Because P P̃˜ will be normalized, we can safely exp W ignore(dmultiplication
v) + b by factors that
are constan
constantt with respect q to y: 
Because P˜ will be normalized, Q we can safely  ignore multiplication  by factors that
s Y  
are constant with respect to y:
P˜ensemble
P̃ (yr=Qy | v) ∝ P 2n  exp W y,: > (d  v ) + b (7.64)
d∈{0,1} n
P˜ (y = y v ) exp W (d v) + b (7.64)
 
| ∝ X 
1 >
= exp  n Wy,:(d  v) + b (7.65)
2
1 d∈{s 0,1}n
Y
= exp  W (d  v) + b  (7.65)
2 1 >
= exp W v + b (7.66)
2 y,:
= exp 1
Substituting this bac backk in
into
to Eq. 7.58 w2eW v + ab softmaxclassifier with weigh
obtain (7.66)
weightsts
1 X
2W .  
Substituting this back into Eq. 7.58 we obtain a softmax classifier with weights
WThe
. weigh
weightt scaling rule is also exact in other settings, including regression
net
netwworks with conditionally normal outputs,   netw
and deep networks
orks that havhavee hidden
la
lay The weigh t scaling
yers without nonlinearities. Ho rule is also
How wev exact
ever, in other
er, the weight scaling settings,rule including
is only an regression
appro
approxi-
xi-
net w orks with conditionally
mation for deep models that hav normal outputs, and deep
havee nonlinearities. Though the appronetw orks that hav
approximatione hidden
ximation has
lay ers without nonlinearities. Ho w ev er, the w eight
not been theoretically characterized, it often works well, empirically scaling rule is only
empirically.. Go an oappro
Goo xi-
dfellow
mation
et for deep
al. (2013a ) found models
exp that hav
experimen
erimen
erimentally
tallye that
nonlinearities.
weightt Though
the weigh the appro
scaling appro
approximation ximation
ximation can whas
ork
not
b been
etter (in theoretically characterized,
terms of classification it often
accuracy) thanwMonte
orks wCarlo
ell, empirically . Goodfellow
approximations to the
et al. (ble
ensem
ensemble 2013a ) found exp
predictor. This erimen
held tally
true evthat
en the
even when weigh
thetMon
scaling
Monte approappro
te Carlo ximation
approximationcan wwas
ximation ork
b etter
allo
allow (in terms of classification
wed to sample up to 1,000 sub-net accuracy)
sub-netw than Monte Carlo approximations
works. Gal and Ghahramani (2015) found to the
ensem ble predictor. This held
that some models obtain better classification true ev en when the Monteusing
accuracy Carlotwenappro
tyximation
wenty samples and was
allow
the ed to
Mon
Monte te sample
Carlo appro up toximation.
1,000 sub-net
approximation. works. that
It appears Gal the
andoptimal
Ghahramani choice(2015 ) found
of inference
that
appro some
approximation
ximationmodels obtain better classification accuracy using twenty samples and
is problem-dependent.
the Monte Carlo approximation. It appears that the optimal choice of inference
Sriv
Srivasta
asta
astav va et al. (2014) show showed ed that drop dropout
out is more effective than other
approximation is problem-dependent.
standard computationally inexp inexpensive
ensive regularizers, suc such h as weigheightt decay
decay,, filter
normSriv astava etand
constraints al.sparse
(2014)activit
showyedregularization.
activity that dropoutDrop is more
Dropoutout ma effective
may y also bethan
com other
combined
bined
standard
with othercomputationally
forms of regularization inexpensiveto yield regularizers, such as
a further improv
improvement. weight decay, filter
ement.
norm constraints and sparse activity regularization. Dropout may also be combined
withOneotheradv
advantage
antage
forms of of drop
dropoutout is that
regularization to yield it isa very
furthercomputationally
improvement. cheap. Using
drop
dropout
out during training requires only O (n ) computation per example per update,
One advnantage
to generate random of binary
dropout num isbthat
umb ers and it is very computationally
multiply them by the state. cheap.
Dep Using
Depending
ending
drop
on outimplemen
the during training
implementation, tation, requires
it may also O (n ) computation
onlyrequire O (n) memory pertoexample per update,
store these binary
to
num generate
umbbers un n
until random
til the bac k-propagation stage. Running inference in the trained ending
binary
back-propagationn um b ers and multiply them by the state. Dep mo
model
del
on the implementation, it may also require O (n) memory to store these binary
numbers until the back-propagation stage. 263 Running inference in the trained mo del
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

has the same cost per-example as if drop dropoutout were not used, though we must pay
the cost of dividing the weigh weights ts by 2 once before beginning to run inference on
has the same cost per-example as if dropout were not used, though we must pay
examples.
the cost of dividing the weights by 2 once before beginning to run inference on
Another significan
examples. significantt adv advan an
antage
tage of drop dropoutout is that it do does
es not significantly limit
the typypee of mo
model del or training pro procedure
cedure that can be used. It works well with nearly
an
anyy Another
mo
model
del that significan
uses a tdistributed
advantagerepresen of dropout
representation is that
tation and canit dobes not significantly
e trained with sto limit
stochastic
chastic
the typet of
gradien
gradient model
descen
descent. t. or
This training
includes profeedforw
cedure that
feedforward ard can
neuralbe net
used.
netw It works
works, well withmo
probabilistic nearly
modelsdels
an
sucy
such mo del that uses a
h as restricted Boltzmann mac distributed represen
machines tation
hines (Sriv and
Srivastav
astav
astava can b
a et al. e trained with sto
al.,, 2014), and recurren chastic
recurrentt
gradien
neural nett
netwdescen
works (Bat. This
Bay includes feedforw
yer and Osendorfer, 2014; Pascan ard neural net
ascanu w orks,
al.,, 2014a). Manymo
u et al. probabilistic dels
other
such as restricted
regularization Boltzmann
strategies machinespow
of comparable (ower
Sriv
er astav
imp
impose a etmore
ose al., sev
2014
severe), restrictions
ere and recurren ont
neural
the arc networks (ofBathe
architecture
hitecture yer model.
and Osendorfer, 2014; Pascanu et al., 2014a). Many other
regularization strategies of comparable power impose more severe restrictions on
Though the cost per-step of applying drop dropout
out to a specific mo model
del is negligible,
the architecture of the model.
the cost of using drop dropout out in a complete system can be significan significant. t. Because dropout
is a Though the cost
regularization tec phnique,
er-step it
technique, of reduces
applying thedrop out to
effectiv
effective a specificofmo
e capacity del is negligible,
a model. To offset
the cost of using drop out in a complete system can be
this effect, we must increase the size of the model. Typically the optimal validation significan t. Because dropout
is a regularization
set error is muc uch h lotec
low hnique, it
wer when using drop reduces the
out, but this comes at the cost of ao m
dropout, effectiv e capacity of a model. T offset
uc
uchh
this effect, we m ust increase the size of the model. Typically
larger model and many more iterations of the training algorithm. For very large the optimal v alidation
set error isregularization
datasets, much lower when confers usinglittledrop out, but in
reduction this comes at the error.
generalization cost ofIn am uch
these
larger the
cases, model and many more
computational cost iterations
of using dropoutof the training
and larger algorithm.
models F or very
may outw large
outweigh eigh
datasets, regularization
the benefit of regularization. confers little reduction in generalization error. In these
cases, the computational cost of using dropout and larger models may outweigh
the When
benefitextremely few labeled training examples are av
of regularization. available,
ailable, drop
dropout
out is less
effectiv
effective. e. Ba Bay yesian neural net networks
works (Neal, 1996) outp outperform
erform dropout on the
When eextremely
Alternativ
Alternative Splicing Datasetfew labeled (Xiong training
et al., examples
2011) where arefew aver
fewerailable, dropout
than 5,000 is less
examples
effectiv
are e. Ba(y
available esian
Sriv
Srivasta
asta
astav neural
va et al. net
al., works
, 2014 (Neal,additional
). When 1996) outp erform data
unlabeled dropout
is avon the
ailable,
Alternativ
unsup
unsupervised e Splicing Dataset ( Xiong
ervised feature learning can gain an adv et al. , 2011 )
advantagewhere
antage ov few
over er
er drop than
dropout.out.5,000 examples
are available (Srivastava et al., 2014). When additional unlabeled data is available,
unsupWager
ervised et al. (2013learning
feature ) show
showed edcanthat, gainwhen
an advapplied
antagetoovlinear
er drop regression,
out. drop
dropout out
2
is equiv alentt to L weigh
equivalen
alen eightt deca
decay y, with a different weigh weightt decay co coefficient
efficient for
eac
each W ager et al. ( 2013 ) show
h input feature. The magnitude of eac ed that, when
each applied to
h feature’s weigh linear regression,
eightt deca
decayy co dropout
coefficien
efficien
efficient t is
is equiv alen t to L w eigh t deca y , with a
determined by its variance. Similar results hold for other linear mo different weigh t decay co efficient
models.
dels. For deep for
eac
mo h input
models,
dels, drop feature.
dropout out is not The magnitude
equiv alent to of
equivalent eachdecay
weight feature’s
decay. . weight decay coefficient is
determined by its variance. Similar results hold for other linear models. For deep
The drop
models, sto
stochasticit
chasticit
chasticity
out is not y usedequiv while
alent training
to weightwith decaydrop
dropout
. out is not necessary for the
approac
approach’s h’s success. It is just a means of approximating the sum ov over
er all sub-
mo dels. Wang and Manning (2013) derived analytical approximations for
The
models. sto chasticit y used while training with drop out is not necessary the
to this
approach’s success.
marginalization. Their It is just a means kno
approximation, of approximating
known
wn as fast dr drop
op the
out sum
opout overinallfaster
resulted sub-
mo
con
convdels.
vergence Wang timeand dueManning (2013) derived
to the reduced sto analytical
stocchasticit
hasticity y in the approximations
computation to of this
the
marginalization.
gradien
gradient. t. This metho Their
method approximation, kno wn as fast dr
d can also be applied at test time, as a more principledop out resulted in faster
con v ergence time due
(but also more computationally to the reduced expensive) stocapproximation
hasticity in the to computation
the av
average
erage ov oferthe
over all
gradien
sub-net
sub-netw t.orks
w This thanmetho
the dweighcan talso
eight scaling be applied
appro at test time,
approximation.
ximation. Fast drop as aout
dropout more
has principled
been used
(but also more computationally expensive) approximation to the average over all
sub-networks than the weight scaling appro 264 ximation. Fast drop out has b een used
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

to nearly match the performance of standard drop dropout out on small neural netw networkork
problems, but has not yet yielded a significant improv improvementement or been applied to a
to nearly
large match the performance of standard dropout on small neural network
problem.
problems, but has not yet yielded a significant improvement or been applied to a
largeJust as sto
problem. stocchasticit
hasticity y is not necessary to achiev achievee the regularizing effect of
drop
dropout,
out, it is also not sufficient. To demonstrate this, Warde-F arde-Farley arley et al. (2014)
Just as
designed stochasticit
control exp
experimeny is tsnot
erimen
eriments necessary
using a metho
methodtod achiev
callededr the
drop
op
opout regularizing
out boosting that effecttheyof
dropout, it
designed toisusealsoexactly
not sufficient.
the same To mask
demonstrate
noise asthis, Warde-Fdrop
traditional arleyout
dropout et al.
but(2014
lac
lack k)
designed
its controleffect.
regularizing experimen Drop tsout
Dropout using a metho
boosting d called
trains the dr opoutensemble
entire boosting to that they
join
jointly
tly
designed to use
maximize the log-lik exactly
log-likeliho
eliho
elihoo od on the training set. In the same sense that traditionalk
the same mask noise as traditional drop out but lac
its
drop regularizing
dropoutout is analogous effect. toDrop out bothis
bagging, osting trains the
approach entire ensemble
is analogous to joinAs
to boosting. tly
maximize
in
intended,
tended, exp the erimen
log-likeliho
experimen
eriments ts withod drop
on theouttraining
dropout boosting set.
sho
showInw the samenosense
almost that traditional
regularization effect
drop out is analogous to
compared to training the entire netw bagging, this
network approach is analogous
ork as a single model. This demonstrates thatto b o osting. As
in tended,
the in exp
interpretationerimen
terpretation of dropts with
dropout drop out b o osting sho
out as bagging has value bey w almost
eyond no regularization
ond the inte interpretation effect
rpretation of
compared
drop
dropoutout astorobustness
training the entire netw
to noise. The ork as a single effect
regularization model.of This demonstrates
the bagged ensemble that is
the in
only ac terpretation
achieved of drop out as bagging
hieved when the stochastically sampled ensemble memhas v alue b ey ond
memb the inte rpretation
bers are trained to of
drop out as robustness to
perform well independently of eac noise. The
each regularization
h other. effect of the bagged ensemble is
only achieved when the stochastically sampled ensemble members are trained to
Drop
Dropout out has inspired other sto stochastic
chastic approaches to training exp exponen
onen
onentially
tially
perform well independently of each other.
large ensembles of mo models
dels that share weigh eights.
ts. DropConnect is a sp special
ecial case of
drop Drop
dropoutout out
where haseacinspired
eachh pro
product
ductother
betwstoeen
etween chastic approaches
a single scalar weigh to training
weight t and a exp onen
single tially
hidden
largestate
unit ensembles of models
is considered a unitthat share
that canwbeigh ts. DropConnect
e dropp
dropped ed (Wan et al. al.,is a sp).ecial
, 2013 Sto case
Stochastic
chasticof
drop out where eac h pro duct b etw een a single scalar
pooling is a form of randomized pooling (see Sec. 9.3) for building ensembles weigh t and a single hidden
unit
of constate is considered
convolutional
volutional net
networks
worksa unit
withthat
eachcan conbvolutional
e dropped netw
convolutional (Wan
networkorket attending
al., 2013).toSto chastict
differen
different
poolinglo
spatial iscations
a formofofeac
locations randomized
each h feature map. pooling So (see Sec. out
far, drop
dropout 9.3remains
) for building the mostensembles
widely
of con volutional
used implicit ensem net
ensemble works
ble metho with
method. d. each con volutional netw ork attending to differen t
spatial locations of each feature map. So far, dropout remains the most widely
One of the key insigh insights ts of drop
dropoutout is that training a netw network ork with stochastic
used implicit ensemble method.
beha
ehavior
vior and making predictions by av averaging
eraging ov overer multiple sto stocchastic decisions
One oftsthe
implemen
implements key insigh
a form ts of drop
of bagging withoutparameter
is that training
sharing. a netw ork with
Earlier, wee stochastic
w describ
described ed
b eha
drop
dropoutvior and making predictions
out as bagging an ensemble of mo by av eraging
models ov er m ultiple sto
dels formed by including or excluding c hastic decisions
implemen
units. How ts
Howev ev a
ever, form of bagging with
er, there is no need for this mo parameterdel avsharing.
model Earlier,towbe edescrib
eraging strategy based on ed
dropout as
inclusion andbagging
exclusion. an Inensemble
principle,ofan mo
any dels offormed
y kind randombymo including
dification or
modification excluding
is admissible.
units.
In Howevwe
practice, er, mthere is no need
ust choose mo for this mo
modification
dification del averaging
families strategy
that neural netw toorks
networks be based
are ableon
inclusion
to learn to andresist.
exclusion.
IdeallyIn ,principle,
Ideally, we should anyalso
kinduseof random
mo
model modification
del families that isallow
admissible.
a fast
In practice,
appro
approximate we m ust choose mo dification
ximate inference rule. We can think of an families
any that
y form of mo neural
modification netw orks
dification parametrized are able
toy learn
b a vectorto resist. Ideally,an
µ as training weensemble
should also use moof
consisting delp(families
y | x, µ)that for allow a fast
all possible
vappro
aluesximate inference
of µ. There rule.
is no We can think
requirement thatofµanhay form
hav of modification
ve a finite num
numb ber ofparametrized
values. For
b y a vector µ as
example, µ can be real-v training
real-valued. an ensemble
alued. Sriv
Srivasta
asta
astav consisting of
va et al. (2014) show p( y
showed x , µ ) for
ed that multiplyingall possible
the
v alues
weigh
eights of µ . There is no
ts by µ ∼ N (1, I ) can outprequirement
outperform that
erform drop µ
dropout ha v e a finite |num b er
out based on binary masks. Because of values. F or
E [µ ] = 1 µthe
example, canstandard
be real-vnet
alued.
workSriv
network astava et al. implements
automatically (2014) showed that multiplying
approximate the
inference
weights by µ (1, I ) can outperform dropout based on binary masks. Because
E
[µ ] = 1 the ∼ standard
N network automatically implements approximate inference
265
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

in the ensem
ensemble,
ble, without needing any weight scaling.
So far we ha hav ve describ
described ed drop
dropoutout purely as a means of performing efficient,
in the ensemble, without needing any weight scaling.
appro
approximate
ximate bagging. How However,
ever, there is another view of drop dropoutout that go goes
es further
So far
than this. Dropw e ha
Dropout v e describ ed drop out purely
out trains not just a bagged ensemble of mo as a means of
models,p erforming
dels, but an ensem efficient,
ensemble ble
appro
of mo ximate
models bagging. How ever, there is another view
dels that share hidden units. This means each hidden unit must be able to of drop out that go es further
than
p this.
erform Drop
well out trains
regardless not just
of which a bagged
other hiddenensemble
units are of in mothedels,
mo
model.but Hidden
del. an ensem ble
units
ofust
m mobdels that share
e prepared to behidden
swappedunits.and This means
interc
interchanged
hanged eachbetwhidden
etweeneen mo unit
models. must
dels. be able
Hinton to
et al.
(p2012c
erform) wwell regardless
ere inspired byofanwhich
idea other hidden units
from biology: sexualarerepro
in the
reproduction,model.which
duction, Hidden in units
involv
volv
volveses
m
sw ust
swappingb e prepared
apping genes betw to
etween be swapped
een twtwo and interc hanged
o different organisms, creates ev b etw een mo dels.
evolutionary Hinton
olutionary pressure for et al.
(genes
2012c)towbecome
ere inspired by ango
not just gooidea fromtobiology:
od, but becomesexual readilyreprosw
swapp duction,
app
apped ed betw which
between involves
een different
swapping genes
organisms. Such betw een tw
genes ando different
suc
such organisms,
h features are verycreates evolutionary
robust to changes pressure
in theirfor
genes
en to become
environmen
vironmen
vironment t becausenot justtheygoareod, not
butable
to become readily sw
to incorrectly apped
adapt to betw
un een different
unusual
usual features
organisms.
of an
any Such
y one organism or mo genes and
model. suc h
del. Drop features
Dropoutout th thus are very robust to
us regularizes each hidden unit changes in totheir
be
environmen
not merely at bgo ecause
goood featurethey butare not able tothat
a feature incorrectly
is go
goo od in adapt
many to con
unusual
texts. features
contexts. Warde-
ofarley
F any etoneal.organism or model.drop
(2014) compared Drop
dropoutoutout thus regularizes
training to training each hidden
of large ensem unit
ensembles blestoand
be
not merely that
concluded a goodrop d feature
dropout out offersbut aadditional
feature that improv is go
improvements od in many
ements contexts. Werror
to generalization arde-
F arley
bey
eyond et al. ( 2014 ) compared drop out
ond those obtained by ensembles of independent mo training to training
models.
dels.of large ensem bles and
concluded that dropout offers additional improvements to generalization error
It is imp
important
ortant to understand that a large portion of the pow owerer of drop
dropoutout
beyond those obtained by ensembles of independent models.
arises from the fact that the masking noise is applied to the hidden units. This
can Itbeisseen
impas ortant
a form to ofunderstand
highly in that t,
intelligen
telligen
telligent, a large
adaptiv
adaptive portion of the of
e destruction pow er information
the of dropout
arises
con
conten
ten from the fact that the masking
tentt of the input rather than destruction of the ra noise is applied raww values of the input.This
to the hidden units. For
can be seen
example, as amodel
if the form learns
of highly intelligen
a hidden unitt,hiadaptiv e destruction
that detects a face by offinding
the information
the nose,
conten
then t of the input
dropping hi corresprather
correspondsondsthanto destruction of the raw values
erasing the information of theisinput.
that there a noseFin or
example,
the image.if the
Themodelmo
modeldellearns
must alearn
hidden unit hh that
another detects aredundantly
face by finding thedesnose,
i , either that redundan tly enco
encodes the
then dropping h corresp onds to erasing the
presence of a nose, or that detects the face by another feature, suc information that there
such is a nose
h as the mouth. in
Tthe image. The
raditional noisemo del must
injection teclearn
techniques
hniques that h
another add, either that redundan
unstructured noise at tlythe
enco des the
input are
presence of a nose, or that detects the face by another
not able to randomly erase the information about a nose from an image of a face feature, suc h as the mouth.
T raditional
unless noise injection
the magnitude of thetecnoise
hniques that
is so addthat
great unstructured
nearly all noise of theatinformation
the input are in
not able to randomly
the image is remov removed. erase the information about a nose from
ed. Destroying extracted features rather than original values an image of a face
unless
allo
allows the destruction
ws the magnitude of pro the
cessnoise
process is so use
to make greatof that
all ofnearly
the kno all
knowledgeof theab
wledge information
about
out the input in
the image is that
distribution remov theed.model
Destroying extracted
has acquired so far.features rather than original values
allows the destruction process to make use of all of the knowledge about the input
Another imp
distribution importan
thatortan
ortant t asp
the modelaspect
ecthasof acquired
drop
dropoutout isso that
far. the noise is multiplicative. If the
noise were additive with fixed scale, then a rectified linear hidden unit h i with
addedAnother
noise imp ortansimply
could t aspect of drop
learn to havoute ish that
have the noise is multiplicative. make If the
i b ecome very large in order to mak e
noise were additive
the added noise  insignifican with fixed scale, then a
insignificantt by comparison. Multiplicativrectified linear hidden
Multiplicativee noise do unit
does h with
es not alloalloww
added
suc
such noise  could simply learn to hav e
h a pathological solution to the noise robustness problem. h b ecome v ery large in order to mak e
the added noise  insignificant by comparison. Multiplicative noise does not allow
suchAnother deep learning
a pathological solutionalgorithm,
to the noisebatcbatch h normalization,
robustness problem. reparametrizes the
mo
model
del in a wa way y that in intro
tro
troduces
duces both additive and multiplicative noise on the
Another deep learning algorithm, batch normalization, reparametrizes the
model in a way that introduces both 266 additive and multiplicative noise on the
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

+ .007 × =

+ .007 =
× x+
x sign((∇ xJ (θ, x, y))
sign
sign((∇ x J (θ, x, y))
 sign
y =“panda” “nemato
“nematode” x +on”
x sign ( Jde”
(θ, x, y)) “gibb
“gibbon”
w/ 57.7% w/∇8.2%  sign ( J%(θ, x, y))
w/ 99.3
=“panda”
yconfidence “nemato de”
confidence “gibb
∇on”
confidence
w/ 57.7% w/ 8.2% w/ 99.3 %
Figure 7.8: Aconfidence
demonstration of adv
adversarial
ersarial example generation
confidence applied to GoogLeNet
confidence
(Szegedy et al., 2014a) on ImageNet. By adding an imp imperceptibly
erceptibly small vector whose
Figure ts
elemen
elements 7.8:
areAequal
demonstration
to the signofofadv
theersarial
elementsexample generation
of the gradient of applied
the cost to GoogLeNet
function with
(resp
Szegedy
respect et al. , 2014a ) on ImageNet. By adding an imp erceptibly small
ect to the input, we can change GoogLeNet’s classification of the image. Repro vector whose
Reproduced
duced
elemen
with ts are equal
p ermission to Goo
from the dfellow
sign of et
Goodfellow theal.elements
(2014b).of the gradient of the cost function with
resp ect to the input, we can change GoogLeNet’s classification of the image. Repro duced
with p ermission from Goo dfellow et al. (2014b).
hidden units at training time. The primary purpose of batch normalization is to
impro
improvve optimization, but the noise can ha
hav
ve a regularizing effect, and sometimes
hidden
mak
makes units
es drop
dropoutat training time.
out unnecessary The
unnecessary.. Batc
Batch primary purpose
h normalization of batchfurther
is described normalization is to.
in Sec. 8.7.1
improve optimization, but the noise can have a regularizing effect, and sometimes
makes dropout unnecessary. Batch normalization is described further in Sec. 8.7.1.
7.13 Adv
dversarial
ersarial Training

7.13
In many A dversarial
cases, neural netw Torks
networks raining
hav
havee begun to reach human performance when
ev
evaluated
aluated on an i.i.d. test set. It is natural therefore to wonder whether these
In
mo many
models
dels ha cases,
have neural anetw
ve obtained trueorks have begun
human-lev
uman-level to reach human
el understanding performance
of these tasks. In when
order
ev aluated
to prob on
probee the lev an i.i.d.
level test set. It
el of understanding a net is natural
netwwork has of the underlying task, wethese
therefore to wonder whether can
mo dels
searc
search haexamples
h for ve obtained that a true
the mo human-lev
model el understanding
del misclassifies. Szegedy etofal.
these
(2014btasks. In order
) found that
to en
ev
evenprob e thenet
neural levworks
el of understanding
networks that perform at a net work level
human has ofaccuracy
the underlying
ha
hav task, w100%
ve a nearly e can
searchrate
error for on
examples
examplesthatthat
thearemoindelten
intenmisclassifies.
tentionally Szegedy bety al.
tionally constructed (2014b
using ) found that
an optimization
ev
proen neural
procedure net
cedure to searc works
search that p erform at human level
h for an input x0 near a data point x suc accuracy
such ha v e
h that the moa nearly
model 100%
del output
error
is veryrate on examples
different at x 0. that
In manare yincases,
many tentionally
x0 canconstructed by using
be so similar to xan optimization
that a human
pro cedure
observ
observer to searc h for an input
er cannot tell the difference betw x near
etweena data p oint x suc h that the mo del
een the original example and the adversarial output
is very different
example
example, , but theat x
netw .ork
network In can
manmake
y cases, x
highlycan be so
differen
different similar to See
t predictions. x thatFig.
a human
7.8 for
observ er
an example. cannot tell the difference b etw een the original example and the adversarial
example, but the network can make highly different predictions. See Fig. 7.8 for
Adv
dversarial
ersarial examples ha
an example. hav ve many implications, for example, in computer security security,,
that are bey eyond
ond the scop
scopee of this chapter. Ho Howev
wev
wever,
er, they are interesting in the
con text of regularization because one can reduce the example,
Adv
context ersarial examples ha v e many implications, for error rateinoncomputer security
the original i.i.d.,
that are b ey ond the
test set via adversarial tr scop e of
training
aining this chapter.
aining—training Ho
—training on adv wev er, they
adversarially are interesting
ersarially perturb
erturbed in
ed examples the
context of regularization because one can reduce the error rate on the original i.i.d.
test set via adversarial training—training 267 on adversarially p erturb ed examples
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

from the training set (Szegedy et al., 2014b; Go


Gooodfello
dfellow
w et al.
al.,, 2014b).
Go
Goo odfello
dfellow w et al. (2014b) sho showedwed that one of the primary causes of these
from the training set (Szegedy et al., 2014b; Goodfellow et al., 2014b).
adv
adversarial
ersarial examples is excessiv excessivee linearit linearity y. Neural net networks
works are built out of
Go o dfello w et
primarily linear building blo al. ( 2014b
blocks.) sho wed
cks. In some exp that one of
experiments the
eriments the ov primary erallcauses
overall of these
function they
adv ersarial
implemen
implementt pro examples
proves is excessiv e linearit y. Neural
ves to be highly linear as a result. These linear functions are easy net works are built out of
primarily
to optimize. linear building blo
Unfortunately
Unfortunately, , the cks.valueIn some
of a linearexperiments
functionthe canovchange
erall function they
very rapidly
implemen
if it has ntumerous
proves to be highly
inputs. If wlinear
e change as aeac result.
each h inputThese by linear
, then functions are easy
a linear function
to optimize.
with weights Unfortunately
w can change , the by vas aluemucof
uch ha linear
as ||wfunction
|| 1, which cancan
change
be avery veryrapidly
large
if it
amoun has n umerous inputs.
amountt if w is high-dimensional. Adv If w e c hange
Adversarial eac h input b y , then a
ersarial training discourages this highlylinear function
with
sensitivweights
sensitive e lo
locally
cally w linear
can change
behaehaviorviorbybas much as  the
y encouraging w netw , which
network ork to can
beblo e cally
a very
locally large
constant
amoun
in t if wborho
the neigh
neighb is high-dimensional.
orhoo od of the training Adv data. ersarial training
This||can || be seen discourages
as a way of this highly
explicitly
sensitiv
in
intro
tro e locally
troducing
ducing a local linear behaviorprior
constancy by encouraging
into supervised the netw
neural orknets.
to be locally constant
in the neighborhood of the training data. This can be seen as a way of explicitly
Adv
dversarial
ersarial training helps to illustrate the pow owerer of using a large function
introducing a local constancy prior into supervised neural nets.
family in com combination
bination with aggressiv aggressivee regularization. Purely linear mo models,
dels, lik
likee
A dv ersarial training helps to illustrate the
logistic regression, are not able to resist adversarial examples because they are p ow er of using a large function
family to
forced in be
com bination
linear. Neuralwithnet aggressiv
networks
works are e regularization.
able to represent Purely linear
functions mocan
that dels,range
like
logistic
from regression,
nearly linear toare not lo
nearly able
callytoconstant
locally resist adversarial
and thus hav haveexamples becausetothey
e the flexibility are
capture
forced trends
linear to be linear.
in the Neural
trainingnet works
data are still
while able learning
to represent functions
to resist lo cal that
local can range
perturbation.
from nearly linear to nearly locally constant and thus have the flexibility to capture
Adv
linear dversarial
ersarial
trends in examples
the training alsodata providewhilea still means of accomplishing
learning to resist local semi-supervised
perturbation.
learning. At a point x that is not asso associated
ciated with a lab labelel in the dataset, the
mo
modelAdvitself
del ersarial examples
assigns some also labelprovide
yŷˆ. The amo means
model’s of accomplishing
del’s label yŷˆ ma
may y not besemi-supervised
the true lablabel,
el,
learning.
but if the At mo
model a point
del is highx thatquality
quality, is ,not then asso
yŷˆ ciated withprobability
has a high a label in the dataset, the
of providing the
modellab
true itself
label.
el. W assigns
e can someseek an label yˆ. The moexample
adversarial del’s label yˆ maycauses
x 0 that not bethe theclassifier
true labto el,
but if the
output moeldely0 iswith
a lab
label highy0quality
6 yŷˆ. , Adv
= y
ˆ
thenersarial
Adversarial has aexamples
high probabilitygenerated of using
providing the
not the
true lab el. W
true label but a lab e can
labelseek
el pro an
provided adversarial
vided by a trained mo examplemodelx that causes the
del are called virtual adversarialclassifier to
output a (lab
examples Miy elato
Miyato y with al.,,y2015
et al. = yˆ).. TheAdvclassifier
ersarial examples
ma
may y thengeneratedbe trainedusing not the
to assign the
true label but a lab el pro
0 vided
6 by a trained mo
same label to x and x . This encourages the classifier to learn a function that is del are called virtual adversarial
examples
robust to (small
Miyato et al., an
changes 2015
anywhere ). The
ywhere classifier
along may thenwhere
the manifold be trained
the unlab to assign
unlabeled the
eled data
same label to x
lies. The assumption motivand x . This
motivating encourages
ating this approacapproach the classifier to learn a function
h is that different classes usually lie that is
robust
on to small cmanifolds,
disconnected hanges anywhere and a small alongpthe manifoldshould
erturbation wherenot thebunlab
e ableeled data
to jump
lies. The
from one assumption
class manifold motiv ating this
to another approac
class h is that different classes usually lie
manifold.
on disconnected manifolds, and a small perturbation should not be able to jump
from one class manifold to another class manifold.
7.14 Tangen
angentt Distance, T Tangen
angen
angentt Prop, and Manifold
Tangen
angentt Classifier
7.14 Tangent Distance, Tangent Prop, and Manifold
Man
Many Thine
y mac angen
machine t Classifier
learning algorithms aim to ov
overcome
ercome the curse of dimensionalit
dimensionality
y
by assuming that the data lies near a low-dimensional manifold, as described in
Many machine learning algorithms aim to overcome the curse of dimensionality
by assuming that the data lies near a low-dimensional manifold, as described in
268
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Sec. 5.11.3.
One of the early attempts to take adv advantage
antage of the manifold hyp ypothesis
othesis is the
Sec. 5.11.3.
tangent distanc
distancee algorithm (Simard et al., 1993, 1998). It is a non-parametric
One of
nearest-neigh
nearest-neighb theborearly attempts
algorithm in to take
whic
which advmetric
h the antage used of theis manifold
not the generic hypothesis is the
Euclidean
tangent distanc
distance but one e algorithm (Simard
that is derived fromet al.
kno , wledge
1993, 1998
knowledge of the ). Itmanifolds
is a non-parametric
near whic
which h
nearest-neigh
probabilit
probability bor algorithm
y concen
concentrates.
trates. It is in assumed
which the metric
that we areused is not
trying to the generic
classify Euclidean
examples and
distance
that but one
examples on thethatsame
is derived
manifold from
shareknothewledge
sameofcategorythe manifolds
category. . Since the near which
classifier
probabilit
should beyin concen
invvarian
ariant trates.
t to theItlo iscal
local assumed
factors that we are trying
of variation that corresp to classify
correspond ond examples
to mov
movement and
ement
that examples on the same manifold share
on the manifold, it would make sense to use as nearest-neigh the same category
nearest-neighb . Since the
bor distance bet classifier
etwween
pshould
oin ts x
oints be and
invarian
x t todistance
the the localbfactors
etw
etween
een of
thevariation
manifolds that M corresp
and Mondtotowhic movhement
which they
1 2 1 2
on
resp the
respectiv manifold,
ectiv
ectively it would make sense to use as nearest-neigh
ely belong. Although that may be computationally difficult (it would b or distance b et w een
points xsolving
require and xan the distance bproblem,
optimization etween the to manifolds
find the nearest M and pairM oftopointswhichonthey M1
resp ectiv ely b elong.
and M2 ), a cheap alternativ Although that may
alternativee that makes sense lo b e computationally difficult
cally is to approximate Mi by its
locally (it would
require
tangentt solving
tangen plane atan x ioptimization
and measure problem,
the distance to find
betw the
een nearest
etween the tw twoo pair
tangen
tangents,of ts,
points
or betwon een
etweenM
aand M ),t aplane
tangen
tangent cheapand alternativ
a poin e that
oint.
t. Thatmakes
can bsense
e ac
achievlocally
hiev ed byissolving
hieved to approximate
a lo M by its
low-dimensional
w-dimensional
tangensystem
linear t plane (inat x theand measureofthe
dimension thedistance
manifolds). betwOfeencourse,
the twothis tangen ts, or brequires
algorithm etween
a tangen t plane and a
one to specify the tangent vectors.p oin t. That can b e ac hiev ed b y solving a lo w-dimensional
linear system (in the dimension of the manifolds). Of course, this algorithm requires
In a related spirit, the tangent pr prop
op algorithm (Simard et al., 1992) (Fig. 7.9)
one to specify the tangent vectors.
trains a neural net classifier with an extra penalty to make eac each h output f (x) of
In a related
the neural net lo spirit,
locally the
cally inv ariant to known factors of variation. ,These
tangent
invariant pr op algorithm ( Simard et al. 1992) factors
(Fig. 7.9 of)
trains a neural
variation corresp
correspondnet classifier
ond to mo movemen
vemenwith an extra p enalty to make
vementt along the manifold near which examples of the eac h output f ( x) of
the neural
same class net
concen locally
trate.inv
concentrate. ariant
Lo
Local
cal in to
inv knownisfactors
variance achiev
achieved of vby
ed ariation.
requiring These
∇xf (factors
x) to bof e
vorthogonal
ariation corresp ond
to the kno to mo vemen t along the manifold
wn manifold tangent vectors v at x , or equiv
known ( inear
) which examples
equivalently of
alently that the
same class concen
the directional deriv ativee of f at x in the directions v be small byf (adding
trate.
derivativ
ativ Lo cal in variance is achiev ed by ( irequiring
) x) to bea
orthogonal to
regularization penalt the kno
enalty wn
y Ω: manifold tangent vectors v at x , or equiv∇ alently that
the directional derivative of f at x in the directions v be small by adding a
regularization penalty Ω: X > (i)
2
Ω(f ) = (∇xf (x)) v . (7.67)
i
Ω(f ) = ( f (x)) v . (7.67)
This regularizer can of course by scaled by ∇ an appropriate hyperparameter, and, for
most neural netwnetworks,
orks, we would need to sum ov over
er man
many y outputs rather than the lone
This
outputregularizer can ed
f ( x) describ of course
described here forbysimplicity
scaled by. an
simplicity. As appropriate hyperparameter,
with the tangent and, for
distance algorithm,
mosttangen
the neuraltnetw
tangent orks,are
vectors we derived
would need
X to sumusually

a priori, over man youtputs
from ratherknowledge
the formal than the loneof
output f ( x) describ ed here for simplicity. As with the tangent distance
the effect of transformations such as translation, rotation, and scaling in images. algorithm,
the
T tangen
angen
angent t propt vectors
has beenareused
derived a priori,
not just usually from
for supervised the formal
learning (Simardknowledge
et al., 1992of)
the
but effect
also inofthetransformations
con such as translation,
text of reinforcement
context learning (rotation,
Thrun, 1995and).scaling in images.
Tangent prop has been used not just for supervised learning (Simard et al., 1992)
Tangen
angentt propagation is closely related to dataset augmen augmentation.
tation. In both
but also in the context of reinforcement learning (Thrun, 1995).
cases, the user of the algorithm enco encodes
des his or her prior knowledge of the task
T
by sp angen
ecifying a set of transformations that to
specifyingt propagation is closely related dataset
should not augmen tation.
alter the outputInofboth
the
cases, the user of the algorithm encodes his or her prior knowledge of the task
by specifying a set of transformations269 that should not alter the output of the
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

net
netw work. The difference is that in the case of dataset augmentation, the net netw work is
explicitly trained to correctly classify distinct inputs that were created by applying
network.
more thanThe difference is that
an infinitesimal amountin theof case
theseoftransformations.
dataset augmentation, Tangent thepropagation
network is
explicitly
do
doeses not trained
require to correctlyvisiting
explicitly classify adistinct
new inputinputs that wInstead,
point. ere created by applying
it analytically
more than
regularizes the moan infinitesimal
model amount of these transformations.
del to resist perturbation in the directions corresp T angent propagation
corresponding
onding to
doesspecified
the not require explicitly visiting
transformation. While a new
this input
analyticalpoint.approac
Instead,
approach h isitintellectually
analytically
regularizes
elegan
elegant, t, it hasthetwo model
ma
major to drawbac
jor resist perturbation
drawbacks. ks. First, it in onlytheregularizes
directionsthe corresp
mo
model
delonding to
to resist
the specified ptransformation.
infinitesimal erturbation. Explicit Whiledataset
this analytical
augmen
augmentation approac
tation h is intellectually
confers resistance to
larger perturbations. Second, the infinitesimal approach poses difficultiesdel
elegan t, it has two ma jor drawbac ks. First, it only regularizes the mo fortomo resist
models
dels
infinitesimal
based on rectifiedperturbation.
linear units. Explicit
These dataset
mo
models
delsaugmen
can only tation confers
shrink theirresistance
deriv
derivativ
ativto
atives
es
larger p erturbations. Second,
by turning units off or shrinking their weigh the infinitesimal
weights. approach p oses difficulties
ts. They are not able to shrink their for mo dels
based
deriv
derivativ on
ativ
atives rectified linear units. These
es by saturating at a high value with large mo dels can only weigh shrink
ts, as their
eights, sigmoidderiv orativ
tanhes
b y turning units off or shrinking their weigh ts. They
units can. Dataset augmentation works well with rectified linear units because are not able to shrink their
deriv
differen
differentativ es by saturating
t subsets of rectifiedatunitsa high canvalue
activ with
activate
ate forlarge weigh
differen
different ts, as sigmoid
t transformed ve or tanh
versions
rsions of
units
eac
each can. Dataset
h original input. augmentation works well with rectified linear units b ecause
different subsets of rectified units can activate for different transformed versions of
eachTangen
angent
original t propagation
input. is also related to double backpr ackpropop (Druck
Drucker er and LeCun
LeCun,,
1992
1992)) and adv adversarial
ersarial training (Szegedy et al., 2014b; Go Goo odfellow et al., 2014b).
DoubleTangen bac tkprop
backproppropagation
regularizesis also
therelated
Jacobian to to
double backpr
be small, op (Druck
while adv er andtraining
adversarial
ersarial LeCun,
1992
finds) inputs
and adv ersarial
near traininginputs
the original (Szegedy andettrains
al., 2014b
the mo; Go
modeldelodfellow
to pro et al.the
produce
duce , 2014b
same ).
Double on
output backprop
these as regularizes the Jacobian
on the original inputs. to bTeangen
small,t while
angent adversarial
propagation and training
dataset
finds
augmen inputs
augmentation near
tation using man the original
manually inputs
ually sp ecified transformations both require thatsame
specified and trains the mo del to pro duce the the
output
mo
model on these
del should be inasvarian
inv on the
ariant t tooriginal
certain inputs.
sp ecifiedTdirections
specified angent propagation
of change in and thedataset
input.
augmenbac
Double tation
backprop
kprop usingandmanadv ually
adversarial
ersarial sptraining
ecified transformations
both require that boththe require
mo
model that the
del should be
mo
in
inv del
varian should
ariantt to al b e in v arian t to certain sp ecified directions of c
alll directions of change in the input so long as the change is small. Just hange in the input.
Double bac
as dataset augmen kprop
augmentationand advis
tation ersarial training both require
the non-infinitesimal versionthat the motdel
of tangen
tangent should be
propagation,
invarian
adv
adversarial
ersarialt to training
al l directions
is theofnon-infinitesimal
change in the input so long
version as the change
of double bac
backprop. is small. Just
kprop.
as dataset augmentation is the non-infinitesimal version of tangent propagation,
advTheersarialmanifold
training tangen
tangent
is thet non-infinitesimal
classifier (Rifai etversion al., 2011c ), eliminates
of double backprop. the need to
kno
know w the tangent vectors a priori. As we will see in Chapter 14, auto autoenco
enco
encoders
ders can
The manifold
estimate the manifold tangen tangen t classifier ( Rifai et al.
tangentt vectors. The manifold tangen , 2011c ), eliminates
tangentt classifier makesthe needuse
to
knothis
of w the tangent to
technique vectors
avoida needing
priori. As we will
user-sp
user-specified see in
ecified Chapter
tangent 14, auto
vectors. Asenco ders can
illustrated
estimate
in Fig. 14.10 the ,manifold tangent tangen
these estimated vectors.t vectors
tangent The manifold
go beyond tangen thet classifier
classical makes
in
inv use
variants
of this
that technique
arise out of the to geometry
avoid needingof imagesuser-sp
(sucecified
(such tangent vectors.
h as translation, rotation Asandillustrated
scaling)
in Fig. 14.10 , these estimated tangen t
and include factors that must be learned because they are ob vectors go b eyond the classical
object-sp
ject-sp
ject-specific in v
ecific (suchariants
as
that
mo
moving arise
ving b o out
dy of the
parts). geometry
The of
algorithm images
prop (suc
proposed
osed h as
withtranslation,
the manifoldrotation
tangen and
tangentt scaling)
classifier
and
is include factors
therefore simple:that (1) must
use an beauto
learned
autoenco
enco bder
encoder ecause they are
to learn the ob ject-specific
manifold (such by
structure as
mo
unsupving
unsupervised body learning,
ervised parts). The andalgorithm
(2) use these prop osed ts
tangen
tangents with the manifold
to regularize tangen
a neural nett classifier
classifier
is therefore
as in tangen simple:
tangentt prop (Eq. 7.67).(1) use an auto enco der to learn the manifold structure by
unsupervised learning, and (2) use these tangents to regularize a neural net classifier
as inThistangenchapter
t prophas (Eq.describ
described
7.67).ed most of the general strategies used to regularize
This chapter has described most of the general strategies used to regularize
270
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Normal Tangent

x2

x1

Figure 7.9: Illustration of the main idea of the tangent prop algorithm (Simard et al.,
1992) and manifold tangent classifier (Rifai et al. al.,, 2011c), which b oth regularize the
Figure 7.9: Illustration
classifier output function off the
( x main idea
). Each curv of the tangent
curvee represents theprop algorithm
manifold Simardt et
for a (differen
different al.,
class,
1992 ) and manifold
illustrated here as a tangent
one-dimensional (Rifai et al.
classifier manifold , 2011c),inwhich
embedded a tw b oth regularize
two-dimensional
o-dimensional the
space.
classifier
On outputwefunction
one curve, hav f ( x)
havee chosen. aEach singlecurv e represents
p oint and dra
drawnwnthea manifold
vector that forisa tangen
differen
tangent t ttoclass,
the
illustrated
class manifoldhere(parallel
as a one-dimensional
to and touchingmanifold embedded
the manifold) and a in a twthat
vector o-dimensional
is normal to space.
the
On one curve, we hav e chosen a single p oint and dra wn a vector
class manifold (orthogonal to the manifold). In multiple dimensions there may b e many that is tangen t to the
class manifold
tangen
tangent (parallel
t directions and manto and
many touching
y normal the manifold)
directions. and athe
We expect vector that is normal
classification to the
function to
cclass
hangemanifold
rapidly (orthogonal
as it mov
moves to the
es in the direction
manifold). In multiple
normal to the dimensions
manifold, and there
notmay b e many
to change as
tangen
it mov
movest directions
es along the andclassman y normal
manifold. directions.
Both tangenttW
tangen e expect theand
propagation classification
the manifold function
tangent to
change rapidly
classifier as itf mov
regularize es not
(x) to in the direction
change normal
very muc
much h astox the
mo
mov vmanifold,
es along the and not to change
manifold. Tangent as
it moves along
propagation the class
requires themanifold. Both tangen
user to manually sp t propagation
specify
ecify and compute
functions that the manifold the tangent
tangent
directions (such as fsp
classifier regularize (xecifying
) to notthat
specifying change
smallvery much as x of
translations moimages
ves along the manifold.
remain in the sameTangent
class
propagation
manifold) requires
while the usertangent
the manifold to manually
classifiersp ecify functions
estimates that compute
the manifold tangent the tangent
directions
bdirections
y training(such as spncoder
an autoe ecifying
autoencoder to that small
fit the translations
training data. The of images remain
use of auto
autoenco
enco inders
encoders thetosame class
estimate
manifold) while the
manifolds will be describ manifold
described tangent
ed in Chapter 14. classifier estimates the manifold tangent directions
by training an autoe ncoder to fit the training data. The use of auto enco ders to estimate
manifolds will be describ ed in Chapter 14.
neural netw
networks.
orks. Regularization is a central theme of machine learning and as sucsuchh
will be revisited perio
periodically
dically by most of the remaining chapters. Another cen central
tral
neural netw
theme of macorks.
machine Regularization is a central theme
hine learning is optimization, describ of
described machine
ed next. learning and as suc h
will be revisited periodically by most of the remaining chapters. Another central
theme of machine learning is optimization, described next.

271
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Algorithm 7.1 The early stopping meta-algorithm for determining the best
amoun
amountt of time to train. This meta-algorithm is a general strategy that works
well with a v7.1
Algorithm arietThe
ariety early stopping
y of training meta-algorithm
algorithms and wa
waysys offor determining
quan tifying errorthe
quantifying onbest
the
amoun t
validation set. of time to train. This meta-algorithm is a general strategy that works
well
Letwith n bea thevariet y of training
number of stepsalgorithms
betw
etween
een ev and ways of quantifying error on the
evaluations.
aluations.
validation
Let p beset. the “patience,” the num umb ber of times to observ
observee worsening validation set
Let n b e the
error before giving up.n umber of steps b etween evaluations.
Let pθ bbe ethe
Let the“patience,” the number of times to observe worsening validation set
initial parameters.
o
θerror
← θboefore giving up.
iLet← θ0 be the initial parameters.
jθ ← 0θ
vi ←← 0∞
θj ∗←←0 θ
iv∗ ←←i
θ ←
while ∞θj < p do
i Up ←idate θ by running the training algorithm for n steps.
Update
while
i←← ij +<np do
vUp0← date θ by running the(θtraining
ValidationSetError
alidationSetError( ) algorithm for n steps.
i v0 i<+vnthen
if
v← j ←V0alidationSetError(θ)
if θ← ←vθthen
v∗ <
ij∗ ←0i
vθ ←← vθ0
else i ←i
jv ←←jv + 1
else←
end if
end jwhile j+1
Best end ←if
parameters are θ∗ , best number of training steps is i∗
end while
Best parameters are θ , best number of training steps is i

272
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Algorithm 7.2 A meta-algorithm for using early stopping to determine how long
to train, then retraining on all the data.
Algorithm 7.2 A meta-algorithm for using early stopping to determine how long
to Let X(then
train,
train)
and y (trainon
retraining
)
beallthethetraining
data. set.
( ) ( )
Split X train
and y train
into (X (subtrain) , X (valid)) and ( y(subtrain) , y (valid))
into
Let
resp X
respectiv
ectiv ely.. and y
ectively
ely be the training set.
Split X and y
Run early stopping (Algorithm X ) starting
into (7.1 , Xfrom random y using X, (ysubtrain))
) and ( θ
respectiv
and ely. ) for training data and X (valid) and y (valid) for validation data. This
y(subtrain
Run early
returns stopping
i∗ , the optimal(Algorithm
number of7.1 ) starting from random θ using X
steps.
and θy to random
Set for vtraining
alues again. data and X and y for validation data. This
returns i , the optimal number
Train on X(train) and y (train) for i∗ steps.of steps.
Set θ to random values again.
Train on X and y for i steps.

Algorithm 7.3 Meta-algorithm using early stopping to determine at what ob objec-


jec-
tiv
tivee value we start to ov overfit,
erfit, then contin
continue
ue training until that value is reached.
Algorithm 7.3 Meta-algorithm using early stopping to determine at what ob jec-
tive value we start to overfit, then continue training until that value is reached.
Let X(train) and y (train) be the training set.
Split X(train) and y(train) in to (X (subtrain) , X (valid)) and ( y(subtrain) , y (valid))
into
Let
resp X
respectiv
ectiv ely.. and y
ectively
ely be the training set.
X stopping
Split early
Run and y(Algorithm X ) starting
into (7.1 , Xfrom random y using X, (ysubtrain))
) and ( θ
resp ectiv ely .
and y(subtrain) for training data and X (valid) and y (valid) for validation data. This
Run
up early
updates
dates θ. stopping (Algorithm 7.1) starting from random θ using X
← yJ (θ, X(subtrain
and for training data)) and X
) , y (subtrain and y for validation data. This
updatesJ (θθ., X(valid) , y (valid)) >  do
while
 Train X X(train) ,and
J (θ,on y y (train)) for n steps.
while J (θ, X
← while
end ,y ) >  do
Train on X and y for n steps.
end while

273
Chapter 8
Chapter 8
Optimization for Training Deep
Mo
Models dels
Optimization for Training Deep
Models
Deep learning algorithms in
inv
volv
olvee optimization in man
many
y con
contexts.
texts. For example,
Deep learning algorithms involve optimization in many contexts. For example,
performing inference in mo models
dels suc
suchh as PCA in involv
volv
volveses solving an optimization
Deep learning algorithms inv olve optimization
problem. We often use analytical optimization to write pro in man y con
proofs texts.
ofs or designForalgorithms.
example,
performing
Of all of theinference in models suc
many optimization h as PCA
problems inv involv
involv
olv
olved
ed ines deep
solving an optimization
learning, the most
problem. W e often
difficult is neural netw use
networkanalytical optimization to write
ork training. It is quite common to in pro ofs
inv or design algorithms.
vest days to months of
Of all of the many
time on hundreds of mac optimization
machines problems inv olv ed in deep
hines in order to solve even a single instance learning, theneural
of the most
difficult
net
netw is neural netw ork training. It is quite
work training problem. Because this problem is so imp common to in vest
important days to
ortant and so exp months
expensive,of
ensive,
atime
sp on hundreds
specialized
ecialized set ofof optimization
machines in order to solveha
techniques even
havve baeen single instancefor
developed of the neural
solving it.
net w ork training
This chapter presen problem.
presents Because this problem is so imp
ts these optimization techniques for neural netw ortant and
networkso exp ensive,
ork training.
a specialized set of optimization techniques have been developed for solving it.
ThisIf chapter
you are presen
unfamiliar with
ts these the basic principles
optimization techniques of for
gradient-based
neural network optimization,
training.
we suggest reviewing Chapter 4. That chapter includes a brief overview of numerical
If you are in
optimization unfamiliar
general. with the basic principles of gradient-based optimization,
we suggest reviewing Chapter 4. That chapter includes a brief overview of numerical
This chapter fo focuses
cuses on one particular case of optimization: finding the param-
optimization in general.
eters θ of a neural net netw
work that significantly reduce a cost function J (θ), which
This chapter fo cuses
typically includes a performance on one particular
measurecaseev of optimization:
evaluated
aluated on the entirefinding the param-
training set as
weters of a neuralregularization
ell asθ additional network that terms.
significantly reduce a cost function J (θ), which
typically includes a performance measure evaluated on the entire training set as
wellW ase b egin with regularization
additional a description of ho
how
terms.w optimization used as a training algorithm
for a machine learning task differs from pure optimization. Next, we presen presentt several
of theWeconcrete
begin with a description
challenges of hooptimization
that make w optimization used asnetw
of neural a training
networks algorithm
orks difficult. We
for a machine learning task differs from pure optimization. Next,
then define several practical algorithms, including both optimization algorithms w e presen t several
of the concrete
themselv
themselves challengesfor
es and strategies that make optimization
initializing of neural
the parameters. More netw
adv orks
advanced
anced difficult.
algorithmsWe
then define several practical algorithms, including b
adapt their learning rates during training or leverage information con oth optimization algorithms
contained
tained in
themselves and strategies for initializing the parameters. More advanced algorithms
adapt their learning rates during training or leverage information contained in
274

274
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

the second deriv


derivativ
ativ
atives
es of the cost function. Finally
Finally,, we conclude with a review of
sev
several
eral optimization strategies that are formed by comcombining
bining simple optimization
the second derivatives of
algorithms into higher-lev the
higher-level cost
el pro function.
procedures.
cedures. Finally , we conclude with a review of
several optimization strategies that are formed by combining simple optimization
algorithms into higher-level procedures.
8.1 Ho
How
w Learning Differs from Pure Optimization

8.1 Howalgorithms
Optimization Learning usedDiffers
for training from of deep Pure mo
models Optimization
dels differ from traditional
optimization algorithms in several ways. Mac Machine
hine learning usually acts indirectly indirectly..
Optimization algorithms used for
In most machine learning scenarios, we care ab training of deep
about mo dels differ from
out some performance measure traditional
Poptimization
, that is definedalgorithms in
with resp several
respectect to w a
the ys. Mac
test hine
set and learning
ma
may usually
y also be in acts indirectly
intractable.
tractable. Wee.
W
In most machine
therefore optimizelearning scenarios,. W
P only indirectly
indirectly. wee reduce
care abaout some pcost
different erformance
functionmeasure
J (θ) in
P , that
the hop is defined with resp
hopee that doing so will improv ect to the test set and ma y also b
improvee P . This is in contrast to pure optimization,e intractable. We
therefore optimize P only indirectly . W e reduce a different
where minimizing J is a goal in and of itself. Optimization algorithms for training cost function J (θ ) in
the hop
deep mo edels
modelsthatalso
doing so willinclude
typically improvsomee P . This
sp is in contrast
specialization
ecialization on the tosppure
ecificoptimization,
specific structure of
where
mac
machine minimizing
hine learning ob J is a goal
objectiv
jectiv
jective in and of itself. Optimization algorithms for training
e functions.
deep models also typically include some specialization on the specific structure of
macThine
ypically
ypically, , the ob
learning cost function
jectiv can be written as an av
e functions. average
erage over the training set,
suc
suchh as
Typically, the cost functionJ (θ) =can be written as an average over the training(8.1) set,
E(x,y )∼pˆdata L(f (x; θ ), y ),
such as
where L is the per-example E
= function, fL((x
J (θ)loss f ;(x
θ); θis), the
y), predicted output when (8.1)
the input is x, pp̂ˆdata is the empirical distribution. In the sup supervised
ervised learning case,
is theL target
ywhere is the output.
per-example loss function,
Throughout f (x; θ) is
this chapter, wethe predicted
develop output when
the unregularized
the
sup
supervised
ervised x, pˆwhere
input iscase, is the
the argumen
empiricaltsdistribution.
arguments to L are f (x;In θ)the
andsupy. ervised
How
Howev ev learning
ever,
er, case,
it is trivial
y isextend
to the target output. Throughout
this developmen
development, t, for example,this chapter,
to include weθdevelop
or x as the unregularized
arguments, or to
sup ervised case,
exclude y as argumen where
arguments, the argumen ts to L are f ( x ; θ ) and y .
ts, in order to develop various forms of regularization or How ev er, it is trivial
to
unsupextend
unsupervised development, for example, to include θ or x as arguments, or to
thislearning.
ervised
exclude y as arguments, in order to develop various forms of regularization or
Eq. 8.1 defines an ob objective
jective function with resp respect ect to the training set. We
unsupervised learning.
would usually prefer to minimize the corresp corresponding
onding ob objectiv
jectiv
jectivee function where the
exp Eq.
expectation8.1 defines
ectation is taktaken an ob jective function with
en across the data generating distribution resp ect to the training set.than
pdata rather We
would
just ov usually
over prefer
er the finite to minimize
training set: the corresponding ob jective function where the
expectation is taken across the data generating distribution p rather than
just over the finite training ∗
J (θset:
) = E(x,y)∼pdata L(f (x; θ), y). (8.2)
E
J (θ) = L(f (x; θ), y). (8.2)
8.1.1 Empirical Risk Minimization

8.1.1goalEmpirical
The of a machineRisk Minimization
learning algorithm is to reduce the exp expected
ected generalization
error given by Eq. 8.2. This quantit
quantityy is kno
knownwn as the risk. We emphasize here that
Theexp
the goalectation
of a machine
expectation is tak
taken learning
en ov
over algorithm
er the is to reduce
true underlying the expected
distribution pdatageneralization
. If we knew
error given by Eq. 8.2 . This quantit y is kno wn as the risk . W e emphasize
the true distribution pdata(x, y), risk minimization would be an optimization here task
that
the expectation is taken over the true underlying distribution p . If we knew
the true distribution p 275
(x, y), risk minimization would be an optimization task
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

solv
solvable
able by an optimization algorithm. Ho Howev
wev er, when we do not know pdata(x, y)
wever,
but only hahav
ve a training set of samples, we ha have
ve a machine learning problem.
solvable by an optimization algorithm. However, when we do not know p ( x, y )
The simplest wa wayy to conv
convert
ert a machine learning problem back in into
to an op-
but only have a training set of samples, we have a machine learning problem.
timization problem is to minimize the exp expected
ected loss on the training set. This
The simplest wa y to conv ert a machine
means replacing the true distribution p(x, y) with learning problemdistribution
the empirical back into pp̂ˆan
(x,op-
y)
timization
defined problem
by the is to
training set.minimize
We now the expected
minimize the loss on al
empiric therisk
empirical training set. This
means replacing the true distribution p(x, y) with the empirical distribution pˆ(x, y)
m
X
defined by the training set. We now minimize 1 the empirical risk
Ex,y∼p̂data (x,y)[L(f (x; θ), y)] = L(f (x(i) ; θ), y (i)) (8.3)
m
E 1 i=1
[L(f (x; θ), y)] = L(f (x ; θ), y ) (8.3)
where m is the number of training examples.m

whereThemtraining
is the nproprocess
cess of
umber based on minimizing
training examples. this av average
erage training error is known
as empiric
empirical
al risk minimization
minimization.. In this setting, machine learning is still very similar
The trainingard
to straightforw
straightforward prooptimization.
cess based on minimizing
Rather than X
this average training
optimizing the risk error is known
directly
directly,, we
as empiric al risk minimization
optimize the empirical risk, and hop . In this setting, machine learning is still very
hopee that the risk decreases significantly as well. similar
to straightforw ard optimization.
A variety of theoretical results establish Rather than optimizing
conditions under whic theh risk
which directly
the true risk, can
we
optimize
b e exp the to
expected
ected empirical
decreaserisk,by vand hopamoun
arious e thatts.
amounts. the risk decreases significantly as well.
A variety of theoretical results establish conditions under which the true risk can
Ho
How
be exp w ev
ever,
er,toempirical
ected decrease risk minimization
by various amounts. is prone to ov overfitting.
erfitting. Models with
high capacity can simply memorize the training set. In many cases, empirical
riskHo wever, empirical
minimization is notrisk minimization
really feasible. The is most
proneeffective
to overfitting.
mo
modern Models with
dern optimization
high capacity
algorithms arecan
basedsimply memorize
on gradien
gradient the training
t descent, but manyset. useful
In manyloss cases, empirical
functions, such
risk minimization
as 0-1 loss, hahav is not really
ve no useful deriv feasible.
derivativ
ativ
atives The most
es (the deriv effective
derivative mo dern optimization
ative is either zero or undefined
algorithms
ev
everywhere). are based
erywhere). These tw twoon gradien t descent, but many
o problems mean that, in the context useful of
loss functions,
deep learning,suchwe
as 0-1 loss, ha v e no useful derivativ es (the deriv ative is either
rarely use empirical risk minimization. Instead, we must use a slightly differen zero or undefined
differentt
ev erywhere).
approac
approach, These tw o
h, in which the quantit problems
quantity mean that, in the context of deep
y that we actually optimize is even more differen learning, wet
different
rarelythe
from usequan
empirical
quantittit
tityy thatriskweminimization.
wantt to Instead,
truly wan optimize.we must use a slightly different
approach, in which the quantity that we actually optimize is even more different
from the quantity that we truly want to optimize.
8.1.2 Surrogate Loss Functions and Early Stopping

8.1.2 Surrogate
Sometimes, Loss Fwe
the loss function unctions andab
actually care Early
about Stopping
out (say classification error) is not
one that can be optimized efficiently
efficiently.. For example, exactly minimizing exp expected
ected 0-1
Sometimes,
loss the loss
is typically in function(exp
intractable
tractable we onential
actually in
(exponential care
theabinput
out (say classification
dimension), ev enerror)
even is not
for a linear
one that can
classifier be optimized
(Marcotte and Savefficiently
Savard
ard, 1992 .F).orInexample, exactly minimizing
such situations, one typicallyexpoptimizes
ected 0-1
aloss is otypically
surr
surro gate loss in tractableinstead,
function (exponential
which in theasinput
acts a prodimension),
proxy
xy but has advevenantages
for a linear
advantages
antages. . For
classifier ( Marcotte and Sav
example, the negative log-likelihooard , 1992
log-likelihood ). In such situations, one typically
d of the correct class is typically used as aoptimizes
a surr ogate loss function instead,
surrogate for the 0-1 loss. The negativ which acts as a pro
negativee log-likelihoo
log-likelihood xy butthe
d allows hasmoadv
delantages
model . For
to estimate
example,
the the negative
conditional log-likelihoo
probability d of the
of the classes, correct
given class isand
the input, typically used
if the mo del as
model cana
surrogate
do for the
that well, then0-1 loss.pick
it can Thethe
negativ e log-likelihoo
classes that yield d allows
the least the model to estimate
classification error in
the
exp conditional
expectation.
ectation. probability of the classes, given the input, and if the mo del can
do that well, then it can pick the classes that yield the least classification error in
expectation. 276
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

In some cases, a surrogate loss function actually results in being able to learn
more. For example, the test set 0-1 loss often contin continuesues to decrease for a long
In some cases, a surrogate loss function
time after the training set 0-1 loss has reached zero, when actually results in b eing ableusing
training to learn
the
more.
log-lik F or
log-likeliho
eliho
elihoo example, the test set
od surrogate. This is because ev 0-1 loss often
even contin
en when the expues to decrease
expected for a
ected 0-1 loss is zero,long
time after
one can improv the training set 0-1 loss has reached
improvee the robustness of the classifier by further pushing zero, when training usingapart
the classes the
log-likeach
from elihoother,
od surrogate.
obtainingThis is because
a more confiden
confident evt en
andwhen the classifier,
reliable expected 0-1 thusloss
thus is zero,
extracting
one can
more improve the
information fromrobustness of the
the training classifier
data by further
than would ha
hav vepushing the classes
been possible apart
by simply
from each other,
minimizing the avobtaining
erage 0-1aloss moreonconfiden t andset.
the training reliable classifier, thus extracting
more information from the training data than would have been possible by simply
A very imp importan
ortan
ortantt difference betw etween
een optimization in general and optimization
minimizing the average 0-1 loss on the training set.
as we use it for training algorithms is that training algorithms do not usually halt
at aAlo very
local importan
cal minim
minimum.um.t difference
Instead, abetw maceen
machine
hineoptimization in generalusually
learning algorithm and optimization
minimizes
as we use it for training algorithms
a surrogate loss function but halts when a conv is that training algorithms
convergence do not usually
ergence criterion based on early halt
at a lo cal minim um. Instead, a mac hine learning algorithm
stopping (Sec. 7.8) is satisfied. Typically the early stopping criterion is based on usually minimizes
a surrogate
the loss function
true underlying but haltssuch
loss function, whenasa0-1 conv ergence
loss measuredcriterion
on a based on early
validation set,
stopping (Sec. 7.8 ) is satisfied. T ypically
and is designed to cause the algorithm to halt whenev the early
wheneverstopping criterion is based
er overfitting begins to occur. on
the
T true underlying
raining often halts loss while function, such asloss
the surrogate 0-1function
loss measured
still hasonlarge
a validation
deriv
derivatives, set,
atives,
and
whic
whichis designed to cause the algorithm to halt whenev er overfitting
h is very different from the pure optimization setting, where an optimization b egins to o ccur.
Training often
algorithm halts while
is considered to hathe
havve surrogate
conv ergedloss
converged when function still has
the gradient large deriv
becomes very atives,
small.
which is very different from the pure optimization setting, where an optimization
algorithm is considered to have converged when the gradient becomes very small.
8.1.3 Batch and Minibatch Algorithms
Batch

8.1.3aspect
One Batc ofhmachine
and Minibatch Algorithms
learning algorithms that separates them from general
optimization algorithms is that the ob objectiv
jectiv
jectivee function usually decomp
decomposes
oses as a sum
One aspect of machine learning algorithms that
over the training examples. Optimization algorithms for macseparates
machine them fromtypically
hine learning general
optimization
compute eachalgorithms
up date toisthe
update thatparameters
the ob jectivbased
e function
on anusually
exp decomp
expected
ected oses
value of as
thea cost
sum
over the training
function estimated examples. Optimization
using only a subset ofalgorithms
the terms for macfull
of the hinecost
learning typically
function.
compute each update to the parameters based on an expected value of the cost
For example, maxim
maximum um likelihoo
likelihood d estimation problems, when view vieweded in log
function estimated using only a subset of the terms of the full cost function.
space, decomp
decomposeose in
into
to a sum ovover
er eac
eachh example:
For example, maximum likelihood estimation problems, when viewed in log
space, decompose into a sum over eac Xmh example:
θML = arg max log p model(x(i) , y(i); θ). (8.4)
θ i=1
θ = arg max log p (x , y ; θ). (8.4)
Maximizing this sum is equivequivalent
alent to maximizing the exp expectation
ectation ov
over
er the
empirical distribution defined by the training set:
Maximizing this sum is equivalent to maximizing the expectation over the
empirical distribution defined
J (θ) = by X
the
Ex,y∼p̂ training set: (x, y; θ).
log pmodel (8.5)
data

E
J (θ ) = log p (x, y; θ). (8.5)
Most of the prop
properties
erties of the ob jective function J used by most of our opti-
objective
mization algorithms are also expexpectations
ectations ovover
er the training set. For example, the
Most of the properties of the ob jective function J used by most of our opti-
mization algorithms are also expectations 277 over the training set. For example, the
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

most commonly used prop


property
erty is the gradien
gradient:
t:

most commonly used ∇θprop


J (θ)erty
= Eisx,y∼p̂
thedata ∇θ log
gradien t: pmodel (x, y; θ). (8.6)
E
Computing this exp J (ectation
θ) =
expectation exactly islog p expensive
very (x, y; θ). because it requires (8.6)
ev
evaluating
aluating the mo modeldel∇ on every example ∇ in the entire dataset. In practice, we can
Computing
compute these exp this ectations by randomly is
exp
expectations ectation exactly very expensive
sampling a small num because
numb ber ofitexamples
requires
ev aluating the mo del on
from the dataset, then taking the avevery example
average in
erage ov the
over entire dataset.
er only those examples. In practice, we can
compute these expectations by randomly sampling a small number of examples
Recall that√the standard error of the mean (Eq. 5.46) estimated from n samples
from the dataset, then taking the average over only those examples.
is given by σ/ n, where σ is the true standard deviation of the value of the samples.
√ from nto
TheRecall that the of
denominator standard
n sho error
showsws thatof the mean
there (Eq.
are less5.46
than) estimated
linear returns samples
using
is given by
more examples σ / n, where σ is the true standard deviation of the value of the samples.
√ to estimate the gradien gradient. t. Compare two hypothetical estimates of
Thegradient,
the denominator of non
one based sho ws examples
100 that thereand areanother
less than linear
based on returns to using
10,000 examples.

morelatter
The examples to estimate
requires 100 timesthe gradien
more t. Compare
computation thantwo thehypothetical
former, butestimates
reduces the of
the gradient,
standard errorone based
of the meanon 100onlyexamples
by a factor andofanother
10. Most based on 10,000algorithms
optimization examples.
The
con
conv latter
verge muc requires
much 100 times more computation than
h faster (in terms of total computation, not in terms of num the former, but reduces
numb berthe
of
standard
up dates) error
updates) if they of are
the allo
mean
allow wed onlyto by a factor
rapidly of 10. appro
compute Most ximate
optimization
approximate estimatesalgorithms
of the
con verge
gradien
gradient much faster
t rather (in terms
than slowly of totalthe
computing computation,
exact gradient. not in terms of number of
updates) if they are allowed to rapidly compute approximate estimates of the
Another
gradien t ratherconsideration
than slowlymotiv
motivating
ating statistical
computing the exactestimation
gradient. of the gradient from a
small num
numb ber of samples is redundancy in the training set. In the worst case, all
Anotherinconsideration
m samples the training set motiv atingbestatistical
could identical estimation
copies of each of the gradient
other. from a
A sampling-
small num
based ber ofofsamples
estimate is redundancy
the gradien
gradient t could compute in the training
the correctset.gradient
In the worst
with case, all
a single
m samples
sample, usingin the training
m times lessset could be identical
computation than thecopiesnaiveofapproac
each other.
approach. A sampling-
h. In practice, we
based estimate of the
are unlikely to truly encoun gradien
encounter t could compute the correct
ter this worst-case situation, but we ma gradient with
may a single
y find large
sample,
n um
umb bersusing m timesthat
of examples less all
computation
make very than similar thecontributions
naive approac toh.the
In gradien
practice,
gradient.t. we
are unlikely to truly encounter this worst-case situation, but we may find large
Optimization algorithms that use the en entire
tire training set are called batch or
numbers of examples that all make very similar contributions to the gradient.
deterministic gradient metho methods, ds, because they pro process
cess all of the training examples
sim Optimization
simultaneously algorithms
ultaneously in a large batc that
batch. use the
h. This terminologyen tire training
can besetsomewhat
are calledconfusing
batch or
deterministic
b ecause the wgradient
ord “batch”metho is ds,
alsobecause
often used they to prodescrib
cess alleofthe
describe theminibatch
training examples
used by
sim ultaneously
minibatc
minibatch h sto in
stochastic a large batc
chastic gradient descen h. This
descent. terminology can b e
t. Typically the term “batch gradiensomewhat confusing
gradientt descent”
b ecause the w ord “batch” is also often used to describ e
implies the use of the full training set, while the use of the term “batch” to describthe minibatch used by
describe e
minibatc h sto chastic
a group of examples do gradient
does
es not. F descen
For t. T ypically the term “batch
or example, it is very common to use the term gradien t descent”
implies
“batc
“batch the use
h size” of the full
to describ
describe e thetraining
size ofset, while the use of the term “batch” to describe
a minibatch.
a group of examples does not. For example, it is very common to use the term
Optimization algorithms that use only a single example at a time are sometimes
“batch size” to describe the size of a minibatch.
called sto
stochastic
chastic or sometimes online metho methods. ds. The term online is usually reserved
for the case where the examples are drawnsingle
Optimization algorithms that use only a from example
a streamatofacon time are sometimes
continually
tinually created
called stochastic
examples rather or thansometimes online metho
from a fixed-size ds. The
training set term
ov er online
over which is usually
sev
several reserved
eral passes are
for
made.the case where the examples are drawn from a stream of con tinually created
examples rather than from a fixed-size training set over which several passes are
Most algorithms used for deep learning fall somewhere in betw etween,
een, using more
made.
278 fall somewhere in b etween, using more
Most algorithms used for deep learning
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

than one but less than all of the training examples. These were traditionally called
minib
minibatch
atch or minib
minibatch
atch sto
stochastic
chastic metho
methodsds and it is no
now
w common to simply call
than one
them sto but less
stochastic than
chastic metho
methods.all
ds. of the training examples. These were traditionally called
minibatch or minibatch stochastic methods and it is now common to simply call
The
them stocanonical example
chastic metho ds. of a stostochastic
chastic metho
methodd is sto
stochastic
chastic gradien
gradientt descent,
presen
presented
ted in detail in Sec. 8.3.1.
The canonical example of a stochastic method is stochastic gradient descent,
Minibatc
Minibatch h sizes are generally driv
driven
en by the following factors:
presented in detail in Sec. 8.3.1.
•Minibatc
Larger hbatches
sizes are generally
provide drivaccurate
a more en by theestimate
followingoffactors:
the gradient, but with
less than linear returns.
Larger batches provide a more accurate estimate of the gradient, but with
•• Multicore architectures
less than linear returns.are usually underutilized by extremely small batc batches.
hes.
This motiv
motivates
ates using some absolute minim minimum um batch size, belo elow
w which there
Multicore
is architectures
no reduction in the timeare usually
to pro underutilized
process
cess a minibatch. by extremely small batches.
• This motivates using some absolute minimum batch size, below which there
• If all reduction
is no examples in the the time
batc
batch htoare
proto be apro
cess processed
cessed in parallel (as is typically
minibatch.
the case), then the amount of memory scales with the batch size. For many
If all examples
hardw
hardwareare setupsinthis theisbatc
thehlimiting
are to bfactor
e processed
in batch in size.
parallel (as is typically
• the case), then the amount of memory scales with the batch size. For many
• Some
hardwkinds of hardware
are setups this is theac
achiev
hiev
hieve e better
limiting runtime
factor in batchwithsize.
sp
specific
ecific sizes of arrays.
Esp
Especially
ecially when using GPUs, it is common for power of 2 batch sizes to offer
better kinds
Some of hardware
runtime. Typical pac owhiev
ower e b2etter
er of batchruntime withfrom
sizes range specific sizes
32 to 256,ofwith
arrays.
16
• Esp ecially when
sometimes beingusing GPUs,for
attempted it is common
large mo
models. for
dels. p ower of 2 batch sizes to offer
better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16
• Small
sometimesbatc
batches
hes canattempted
being offer a regularizing
for large mo effect
dels. (Wilson and Martinez, 2003),
perhaps due to the noise they add to the learning pro process.
cess. Generalization
Small batc hes can offer
error is often best for a batc a regularizing
batch effect ( Wilson
h size of 1. Training with such and Martinez
a small, 2003
batch),
perhaps
• size might due to thea noise
require small they addrate
learning to the learning
to main
maintain process. due
tain stability Generalization
to the high
variance in the estimate of the gradient. The total runtime can be verybatch
error is often b est for a batc h size of 1. T raining with such a small high
size might
due to the require
need toamake small more
learning rateboth
steps, to main tain stability
because due to learning
of the reduced the high
variance
rate and in the estimate
because it tak esofmore
takes the gradient. The total
steps to observe theruntime can be vset.
entire training ery high
due to the need to make more steps, both because of the reduced learning
rate and
Differen
Different because
t kinds it takes more
of algorithms steps to observe
use different kinds ofthe entire training
information from theset. mini-
batc
batch h in different wa ways.
ys. Some algorithms are more sensitiv sensitivee to sampling error than
Differen
others, t kinds
either becauseof algorithms use different
they use information thatkinds of information
is difficult fromaccurately
to estimate the mini-
batchfew
with in different
samples,wa orys. Some they
because algorithms are more sensitiv
use information in wa
waysyse to
thatsampling
amplifyerror than
sampling
others,more.
errors either Metho
becausedsthey
Methods thatuse information
compute up thatbased
updates
dates is difficult
only on to estimate
the gradienaccurately
gradient t g are
with few samples, or b ecause they use information in wa ys
usually relatively robust and can handle smaller batch sizes like 100. Second-orderthat amplify sampling
errors
metho
methods, more.
ds, whic
whichMetho
h use ds alsothatthecompute
Hessian up dates H
matrix based
andonly
computeon the up gradien
updates
dates suc t ghare
such as
usually
H −1 g relatively robust and can handle smaller batch sizes like 100. Second-order
, typically require much larger batc batch h sizes like 10,000. These large batch
methoare
sizes ds, required
which use to also the Hessian
minimize fluctuationsmatrixin H theand compute
estimates up−1
of H dates such as
g. Suppose
H g , typically require m uch larger batc
that H is estimated perfectly but has a poor condition num h sizes like 10,000.
numb These large
ber. Multiplication batch
by
sizes are required to minimize fluctuations in the estimates of H g. Suppose
that H is estimated perfectly but has a279 poor condition number. Multiplication by
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

H or its inv erse amplifies pre-existing errors, in this case, estimation errors in g.
inverse
Very small changes in the estimate of g can th thus
us cause large changes in the up update
date
H −1or its inv erse amplifies
H g , even if H were estimated perfectly pre-existing errors, in this case, estimation
erfectly.. Of course, H will be estimated only errors in g.
Very small
appro
approximately
ximately
ximately,changes
, so thein the
up estimate
update
date H −1of g cancontain
g will thus causeevenlarge morechanges
error thanin the weup date
would
H g , even if H w ere estimated
predict from applying a poorly conditioned op p erfectly . Of eration to the estimate of g . only
course,
operation H will b e estimated
approximately, so the update H g will contain even more error than we would
It is also crucial that the minibatches be selected randomly randomly.. Computing an
predict from applying a poorly conditioned operation to the estimate of g .
un
unbiased
biased estimate of the exp expected
ected gradien
gradientt from a set of samples requires that those
It is also
samples be indep crucial
independent. that the minibatches
endent. We also wish for tw twoboe subsequent
selected randomly gradient. estimates
Computing to an
be
un biased
indep
independen
enden estimate of the
endentt from each other, so twexp ected gradien
two t from
o subsequen a set of samples requires
subsequentt minibatches of examples should that those
samples
also be indep
be indep
independenenden
endentendent.
t fromW e also
each wish Many
other. for twodatasets
subsequent gradient
are most estimates
naturally to be
arranged
indep
in a wenden
ay wheret from each other,
successive so twoare
examples subsequen t minibatches
highly correlated. For of examples
example, weshould
might
also
ha
hav b e indep enden t from each other.
ve a dataset of medical data with a long list of blo Many datasets bloo are most naturally
od sample test results. This arranged
in a w ay where successive
list might be arranged so that first we ha examples are highly
hav ve five blocorrelated.
bloo od samplesFor example,
taken atwdifferent
e might
have afrom
times datasettheoffirst medical
patient,datathen withwe a long
ha ve list
have threeof blo
blo
bloo oodd sample
samplestest results.
taken fromThis the
list might b e arranged
second patient, then the blo so that
bloo first we ha v e five blo o d samples
od samples from the third patient, and so on. If we taken at different
times from the first
were to draw examples in order patient, thenfrom wethis
havelist,three
thenblo each od ofsamples taken from
our minibatches the
would
second patient, then the blo o d samples from the
be extremely biased, because it would represent primarily one patient out of the third patient, and so on. If we
w
manereytopatien
many drawtsexamples
patients in order
in the dataset. In from
casesthis
suchlist, then each
as these whereofthe ourorder
minibatches would
of the dataset
be extremely
holds biased, because
some significance, it is it would represent
necessary to shuffleprimarily the examples one patient
beforeout of the
selecting
many patien
minibatc
minibatches. hes.tsFin or the
verydataset. In cases such
large datasets, as thesedatasets
for example where the conorder
containing
tainingof the dataset
billions of
holds some significance, it is necessary to shuffle the
examples in a data center, it can be impractical to sample examples truly uniformly examples b efore selecting
minibatc
at random hes.ev
everyFortime
ery very welargewantdatasets, for example
to construct a minibatc datasets
minibatch. containing
h. Fortunately
ortunately, , inbillions
practiceof
examples
it is usually in asufficient
data center, it can the
to shuffle be impractical
order of thetodatasetsample once examples
and thentruly store
uniformly
it in
at
sh random
shuffled ev ery time
uffled fashion. This will imp we w ant
impose to construct a minibatc
ose a fixed set of possible minibatc h. F ortunately
minibatches , in practice
hes of consecutive
it is usually
examples that all mosufficient to
models shuffle the order of the
dels trained thereafter will use, and eac dataset once
eachand then store
h individual mo itdel
model in
shuffled
will fashion.toThis
be forced reusewillthis
impordering
ose a fixed set of
every timepossible minibatc
it passes hes ofthe
through consecutive
training
examples
data. Ho Howevwevthat
wever, all mo dels trained thereafter
er, this deviation from true random selection do will use, and eac
does h individual
es not seem to hav model
have ea
will b
significane forced to reuse this ordering
significantt detrimental effect. Failing to ever sh every time
shuffle it passes through
uffle the examples in any wa the training
wayy can
data. Ho wev er, this deviation from
seriously reduce the effectiveness of the algorithm. true random selection do es not seem to hav ea
significant detrimental effect. Failing to ever shuffle the examples in any way can
Man
Many y optimization problems in machine learning decomp decompose ose over examples
seriously reduce the effectiveness of the algorithm.
well enough that we can compute entire separate up updates
dates over different examples
Man y optimization problems
in parallel. In other words, we can compute the up in machine learningdatedecomp
update ose over Jexamples
that minimizes (X ) for
w ell enough that we can compute entire separate
one minibatch of examples X at the same time that we compute the up up dates ov er different examples
update
date for
in
sev parallel.
several
eral other In other words,
minibatches. w
Suc
Suche hcan compute
asynchronous the up date
parallel that minimizes
distributed J
approaches (X ) for
are
one minibatch of examples
discussed further in Sec. 12.1.3. X at the same time that we compute the up date for
several other minibatches. Such asynchronous parallel distributed approaches are
An interesting motiv motivation
ation for minibatch sto stocchastic gradient descen descentt is that it
discussed further in Sec. 12.1.3.
follo
followsws the gradien
gradientt of the true generalization error (Eq. 8.2) so long as no
An interesting
examples are rep motivation
repeated.
eated. Most for minibatch stocof
implementations hastic
minibatcgradient
minibatch h sto
stocdescen
chastic t isgradien
that itt
gradient
follows the gradient of the true generalization error (Eq. 8.2) so long as no
examples are repeated. Most implementations of minibatch stochastic gradient
280
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

descen
descentt shuffle the dataset once and then pass through it multiple times. On the
first pass, each minibatch is used to compute an un unbiased
biased estimate of the true
descen t shuffle the dataset once and then pass
generalization error. On the second pass, the estimate becomes through it multiple times.
biased becauseOn itthe
is
first pass, each minibatch
formed by re-sampling values that havis used to compute an unbiased estimate
havee already been used, rather than obtainingof the true
generalization error. On the second
new fair samples from the data generating pass, thedistribution.
estimate becomes biased because it is
formed by re-sampling values that have already been used, rather than obtaining
The fact that sto stocchastic gradien
gradientt descent minimizes generalization error is
new fair samples from the data generating distribution.
easiest to see in the online learning case, where examples or minibatc minibatches
hes are drawn
The
from a str fact that sto c hastic gradien t descent minimizes generalization
streeam of data. In other words, instead of receiving a fixed-size training error is
easiest to see in the online learning case, where examples or minibatc
set, the learner is similar to a living being who sees a new example at each instant, hes are drawn
from every
with a streexample
am of data.
(x, y)Incoming
other words,
from theinstead of receiving
data generating a fixed-size
distribution training
p data(x, y ).
set, the learner is similar to a
In this scenario, examples are never repliving b eing who
repeated; sees a
eated; every expnew example
experience at each instant,
erience is a fair sample
with every
from p data. example (x , y ) coming from the data generating distribution p (x, y ).
In this scenario, examples are never repeated; every experience is a fair sample
The equiv alence is easiest to derive when both x and y are discrete. In this
equivalence
from p .
case, the generalization error (Eq. 8.2) can be written as a sum
The equivalence is easiest to derive when both x and y are discrete. In this
XX
case, the generalization J ∗(θerror
) = (Eq. 8.2p)data can(xb, ey)written
L(f (x; θas), ay)sum
, (8.7)
x y
J (θ ) = p (x, y)L(f (x; θ), y), (8.7)
with the exact gradient
XX
with the exact gradient
g = ∇θ J ∗ (θ) = p data(x, y)∇x L(f (x; θ), y). (8.8)
XxXy
g= J (θ) = p (x, y) L(f (x; θ), y). (8.8)
We hav
havee already seen ∇ the same fact demonstrated∇for the log-likelihoo log-likelihood d in Eq. 8.5
and Eq. 8.6; we observ observee no w that this holds for other functions L besides the
now
Weeliho
lik havo
likeliho
elihooe d.
already
A similarseen result
the same
canfact demonstrated
be derived when xforand theylog-likelihoo
are contin d in Eq.
continuous,
uous, 8.5
under
and Eq.
mild 8.6; we observ
assumptions regarding e nopw that
XX this holds for other functions L besides the
data and L.
likelihood. A similar result can be derived when x and y are continuous, under
Hence, w wee can obtain an un unbiased
biased estimator of the exact gradient of the
mild assumptions regarding p and L.
generalization error by sampling a minibatc minibatch h of examples {x(1) , . . . x(m)} with cor-
respHence,
responding w e can obtain
( i ) an un biased estimator
onding targets y from the data generating distribution of the exact
pdata,gradient of the
and computing
generalization
the gradient oferror by sampling
the loss with resp aect
respectminibatc
to thehparameters
of examplesforxthat with cor-
, . .minibatch:
.x
responding targets y from the data generating distribution { p , and }computing
the gradient of the loss with resp 1 ect X to the parameters for that minibatch:
ˆ = ∇θ
g L(f (x(i) ; θ), y (i)). (8.9)
m
1 i
ˆ=
g L(f (x ; θ), y ). (8.9)
Up dating θ in the direction ofmgˆ∇performs SGD on the generalization error.
Updating
Of course,
Updating thisdirection
θ in the interpretation only applies
of gˆ performs SGD on when examples are error.
the generalization not reused.
Nonetheless, it is usually best to make sev several
eral passes through the training set,
X
unless the training set is extremely large. Whenwhen
Of course, this interpretation only applies multipleexamples
such ep are
epoochsnotarereused.
used,
Nonetheless,
only the first itep
epooischusually
follo
followswsbthe
est unbiased
to make sev eral tpasses
gradien
gradient of thethrough the training
generalization set,
error, but
unless the training set is extremely large. When multiple such epochs are used,
only the first epoch follows the unbiased 281gradient of the generalization error, but
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

of course, the additional ep epoochs usually provide enough benefit due to decreased
training error to offset the harm they cause by increasing the gap bet etw
ween training
of course, the additional
error and test error. ep o chs usually provide enough b enefit due to decreased
training error to offset the harm they cause by increasing the gap between training
errorWith
and some datasets gro
test error. growing
wing rapidly in size, faster than computing pow ower,
er, it
is becoming more common for mac machine
hine learning applications to use eac eachh training
With only
example some once
datasets gro
or ev enwing
even rapidly
to make aninincomplete
size, fasterpass
thanthrough
computingthe ptraining
ower, it
is becoming
set. more an
When using common for mac
extremely hine
large learningset,
training applications
overfittingtoisuse
overfitting noteac
anh issue,
training
so
example only once or ev en to make an incomplete pass through
underfitting and computational efficiency become the predominant concerns. See the training
set. Bottou
also When using an extremely
and Bousquet (2008large
) for training set, of
a discussion overfitting
the effectisofnot an issue, so
computational
underfitting
b ottlenec
ottlenecks
ks onandgeneralization
computational efficiency
error, as thebnum
ecome
numb berthe predominant
of training concerns.
examples grows.See
also Bottou and Bousquet (2008) for a discussion of the effect of computational
bottlenecks on generalization error, as the number of training examples grows.
8.2 Challenges in Neural Net
Netw
work Optimization

8.2 Challenges
Optimization in generalin is Neural Net
an extremely worktask.
difficult Optimization
Traditionally
raditionally,, mac
machine
hine
learning has avavoided
oided the difficult
difficulty y of general optimization by carefully designing
Optimization
the ob
objectiv
jectiv in general is an extremely difficult
jectivee function and constraints to ensure task.
that Traditionallyproblem
the optimization , machine is
learning
con
conv has av oided the difficult
vex. When training neural netw y of general
networks, optimization by carefully designing
orks, we must confront the general non-conv
non-convex ex
the obEven
case. jectivconv
e function
convex and constraints
ex optimization is not to ensureits
without that the optimization
complications. problem
In this is
section,
con
w ev ex. When several
summarize trainingofneural netwprominen
the most orks, we m
prominent t ust confrontin
challenges the
involv
volvgeneral
volved non-convex
ed in optimization
case.
for Even conv
training deepexmo optimization
models.
dels. is not without its complications. In this section,
we summarize several of the most prominent challenges involved in optimization
for training deep models.
8.2.1 Ill-Conditioning

8.2.1 challenges
Some Ill-Conditioning
arise even when optimizing convconvex
ex functions. Of these, the most
prominentt is ill-conditioning of the Hessian matrix H. This is a very general
prominen
Some challenges
problem in most arise even when
numerical optimizing
optimization, conv
conv
convex ex otherwise,
ex or functions. and
Of these, the ed
is describ
describedmost
in
prominen t is ill-conditioning
more detail in Sec. 4.3.1. of the Hessian matrix H . This is a very general
problem in most numerical optimization, convex or otherwise, and is described in
The ill-conditioning problem is generally believ elieved
ed to be presen
presentt in neural
more detail in Sec. 4.3.1.
net
netw
work training problems. Ill-conditioning can manifest by causing SGD to get
The
“stuck” inill-conditioning
“stuck” the sense that problem
even veryissmall
generally believed the
steps increase to cost
be presen t in neural
function.
network training problems. Ill-conditioning can manifest by causing SGD to get
Recall
“stuc fromsense
k” in the Eq. that
4.9 that
even avery
second-order
small stepsTincrease
aylor series
the expansion
cost function.of the cost
gradientt descent step of −g will add
function predicts that a gradien
Recall from Eq. 4.9 that a second-order Taylor series expansion of the cost
function predicts that a gradien1t descentg H g step
2 >
− gof
>
g g will add (8.10)
2 −
1
to the cost. Ill-conditioning of 2the H g bgecomes
 ggradient g a problem when 12 2g(8.10)> Hg

exceeds g> g. T To −
o determine whether ill-conditioning is detrimental to a neural
to
netthe
netw cost. Ill-conditioning of the gradient b ecomes
work training task, one can monitor the squared gradien a problem
gradient t normwheng >g and
g H theg
exceeds g g. To determine whether ill-conditioning is detrimental to a neural
network training task, one can monitor282 the squared gradient norm g g and the
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

16 1.0
14 0.9

Classification error rate


12 0.8
Gradient norm 10 0.7
8 0.6
6 0.5
4 0.4
2 0.3
0 0.2
−2 0.1
−50 0 50 100 150 200 250 0 50 100 150 200 250
Training time (epochs) Training time (epochs)

Figure 8.1: Gradien


Gradientt descent often do does
es not arrive at a critical p oint of an any
y kind. In
this example, the gradien
gradientt norm increases throughout training of a con convolutional
volutional net
netw
work
Figure 8.1:
used for ob Gradien
object t descent
ject detection. (L often do es not arrive at
eft) A scatterplot showing ho
(Left) a critical
how p oint of any kind.
w the norms of individual In
this example,
gradien
gradientt ev the
evaluationsgradien t norm increases
aluations are distributed ov over throughout training
er time. To improv of a
improvee legibilitcon
legibility volutional netw ork
y, only one gradient
used for
norm ob ject pdetection.
is plotted er ep
epooch. The(Left) A scatterplot
running average ofshowing hownorms
all gradient the norms of individual
is plotted as a solid
gradien
curv
curve. t evaluations
e. The are distributed
gradient norm over time.
clearly increases ov
over To improv
er time, rathere than
legibilit y, only one
decreasing gradient
as we would
norm
exp
expect is plotted p er ep
ect if the training pro o ch.
processThe running
cess conv
converged average of all gradient norms is plotted as a
erged to a critical point. (Right) Despite the increasing solid
curve. The
gradien
gradient, gradient
t, the trainingnorm
pro clearly
process
cess increases ovsuccessful.
is reasonably er time, rather
The than decreasing
validation as we would
set classification
expectdecreases
error if the training
to a lo pro
low cess converged to a critical point. (Right) Despite the increasing
w level.
gradient, the training pro cess is reasonably successful. The validation set classification
error decreases to a low level.
g >H g term. In many cases, the gradient norm does not shrink significantly
throughout learning, but the g> H g term grows by more than order of magnitude.
g Hresult
The g term. In many
is that cases,
learning the gradient
becomes very slo norm
slow does the
w despite notpresence
shrink significantly
of a strong
throughout
gradien learning,
gradientt because but the grate
the learning Hmust
g termbegrows
shrunkbytomore
comp than
compensate order
ensate for of magnitude.
even stronger
The
curv result
curvature. is that learning
ature. Fig. 8.1 sho
shows b ecomes very slo
ws an example of the gradienw despite the presence
gradientt increasing significan of
significantly a strong
tly during
gradien t b ecause the learning rate
the successful training of a neural netw must b e
network.
ork.shrunk to comp ensate for even stronger
curvature. Fig. 8.1 shows an example of the gradient increasing significantly during
the Though
successfulill-conditioning is present
training of a neural in other settings besides neural netw
network. network
ork
training, some of the tec techniques
hniques used to com combat
bat it in other con contexts
texts are less
Thoughtoill-conditioning
applicable neural net works.is Fpresent
networks. in other
or example, settings
Newton’s besides
metho
method d is anneural netwto
excellent ork
tool
ol
training, some
for minimizing conv of the
convex techniques used to com bat it in other con texts
ex functions with poorly conditioned Hessian matrices, but in are less
applicable
the to neural
subsequent networks.
sections we willFor example,
argue Newton’s metho
that Newton’s methodd requires
method is an excellent toolt
significan
significant
for dification
mo minimizing
modification convex
before functions
it can with to
be applied poorly
neuralconditioned
netw orks. Hessian matrices, but in
networks.
the subsequent sections we will argue that Newton’s method requires significant
modification before it can be applied to neural networks.
8.2.2 Lo
Local
cal Minima

8.2.2of the
One Lomost
cal Minima
prominent features of a conv
convexex optimization problem is that it
can be reduced to the problem of finding a lo
local
cal minim
minimum.
um. An
Anyy lo
local
cal minim
minimum
um is
One of the most prominent features of a convex optimization problem is that it
can be reduced to the problem of finding
283a lo cal minimum. Any lo cal minimum is
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

guaran
guaranteed teed to be a global minimum. Some con convex
vex functions hav havee a flat region at
the bottom rather than a single global minimum poin oint,
t, but any point within such
aguaran teed toisban
flat region e a acceptable
global minimum. solution. SomeWhen convex functions
optimizing havveex
a con
conv a flat region we
function, at
the w
kno
know bottom
that wrather
haveethan
e hav reac a single
reached
hed a go
goo global minimum
od solution if we pfind
oint,a but any ppoint
critical oint of within such
any kind.
a flat region is an acceptable solution. When optimizing a convex function, we
With non-conv
non-convex ex functions, suc such h as neural nets, it is possible to ha have
ve manmany y
know that we have reached a good solution if we find a critical point of any kind.
lo
local
cal minima. Indeed, nearly an any y deep mo modeldel is essentially guaranteed to ha have
ve
With non-conv
an extremely largeex num
numbfunctions,
ber of lo suc
cal hminima.
local as neural How nets,
However,
ever,it as
is pweossible to ha
will see, ve is
this mannoty
local minima.
necessarily a ma Indeed,
major nearly any deep model is essentially guaranteed to have
jor problem.
an extremely large number of local minima. However, as we will see, this is not
Neural net
necessarily netwworks
a ma and an
jor problem.any y mo
models
dels with multiple equiv equivalen
alen
alently
tly parametrized latent
variables all hav havee multiple lo local
cal minima because of the mo modeldel identifiability problem.
A mo Neural
modeldel is net
saidworks
to beand antifiable
iden y models
identifiable if awith multiplelarge
sufficiently equivtraining
alently parametrized
set can rule out latent
all
variables all hav e
but one setting of the mo m ultiple lo
model’scal minima b ecause of the mo del identifiability
del’s parameters. Models with latent variables are often problem.
A mo del is said to
not identifiable because we b e iden tifiable if a sufficiently
can obtain equiv
equivalent large
alent mo training
dels by set
models exc can
exchanging
hangingrule outlatenallt
latent
vbut one setting
ariables with eachof the
other.model’s parameters.
For example, Models
we could takewith latentnetw
a neural variables
network
ork and aremo often
modify
dify
not
la
lay identifiable
yer 1 by sw swapping because we
apping the incoming weighcan obtain equiv alent mo dels by exc
weightt vector for unit i with the incoming weighhanging weighttt
laten
variables with each other. F or example, w
vector for unit j , then doing the same for the outgoing weigh e could take a neural netw ork and
weightt vectors. If we hav mo dify
have e
lay
m la er
lay 1 by sw apping the incoming weigh t vector m for unit i
yers with n units each, then there are n! ways of arranging the hidden units. with the incoming weigh t
vectorkind
This for unit j , then doing the
of non-identifiabilit
non-identifiability y issame
known foras the outgoing
weight spaceeweigh
spac
ac t vectors.
symmetry
symmetry. . If we have
m layers with n units each, then there are n! ways of arranging the hidden units.
ThisInkind addition to weight space
of non-identifiabilit y issymmetry
symmetry,
known as, weightman
many y kinds
space of neural .net
symmetry networks
works hav havee
additional causes of non-identifiabilit
non-identifiability y. For example, in any rectified linear or
In
maxout netw addition
network, to w eight space symmetry
ork, we can scale all of the incoming , many w kinds
eightsofandneural netof
biases works
a unit havbye
αadditional
if we alsocausesscale allof ofnon-identifiabilit
its outgoing weigh y. Fts
weights orby example,
1 in any rectified linear or
α. This means that—if the cost
maxout netw
function do esork,
does not w e can scale
include termsallsuc ofhthe
such incoming
as weigh
weight t deca wyeights
decay that and
dep
dependbiases
end of a unit
directly on theby
α if we
weigh
eights ts also
ratherscale
than alltheof mo
its dels’
outgoing
models’ weights bylo
outputs—every . This
local
cal minimummeansofthat—if
a rectified thelinear
cost
function
or maxout netw do es not
network include terms suc h as weigh
ork lies on an (m × n )-dimensional hyperb t deca y that
hyperbola dep end directly
ola of equiv
equivalenalen on
alentt lo the
local
cal
weights rather than the models’ outputs—every lo cal minimum of a rectified linear
minima.
or maxout network lies on an (m n )-dimensional hyperbola of equivalent local
These mo
minima. model
del identifiabilit
identifiability y issues mean that there can be an extremely large
×
or ev even
en uncoun
uncountably
tably infinite amoun amountt of lo local
cal minima in a neural netw networkork cost
These
function. Ho mo del
Howev
wev
wever,identifiabilit
er, all of these loy issues
local mean that there can b e
cal minima arising from non-identifiabilit an extremely
non-identifiability ylarge
are
or
equiv evalen
en uncoun
equivalen
alent t to each tably
other infinite
in costamoun
functiont ofvalue.
local As minima in athese
a result, neural
lo
local
cal netw ork cost
minima are
function. Ho wev er, all
not a problematic form of non-conv of these lo cal
non-convexity minima
exity
exity.. arising from non-identifiabilit y are
equivalent to each other in cost function value. As a result, these local minima are
not LoLocal
cal minima can
a problematic formbeofproblematic
non-convexity if they
. hav
havee high cost in comparison to the
global minimum. One can construct small neural netw networks,
orks, even without hidden
units, Local
thatminima
hav
havee lo can
local be problematic
cal minima with higherif theycosthav e high
than the cost
globalin minimum
comparison(Sontag to the
global
and minimum.
Sussman , 1989 One can construct
; Brady et al. small
al.,, 1989 neural
; Gori andnetw
Tesiorks,
, 1992 even
). Ifwithout
lo
local hidden
cal minima
units, that hav e lo cal minima with higher cost than
with high cost are common, this could pose a serious problem for gradient-based the global minimum ( Sontag
and Sussmanalgorithms.
optimization , 1989; Brady et al., 1989; Gori and Tesi, 1992). If local minima
with high cost are common, this could pose a serious problem for gradient-based
It remains an op openen question whether there are many lo local
cal minima of high cost
optimization algorithms.
It remains an open question whether284
there are many local minima of high cost
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

for netw
networksorks of practical in interest
terest and whether optimization algorithms encoun encounterter
them. For man many y years, most practitioners believed that local minima were a
for networks
common of practical
problem plaguing interest
neural and whether
netw
network optimization Talgorithms
ork optimization. day,, thatencoun
oday do
does ter
es not
them.
app
appear
ear to Forbeman
the ycase.
years,Themost practitioners
problem remains anbelieved thatoflocal
active area researcminima
research, were a
h, but experts
common
no
noww susp
suspect problem plaguing
ect that, for sufficien neural
sufficiently netw ork
tly large neural netw optimization.
networks, T
orks, most lo o day
local , that do
cal minima hav es not
have ea
app
low ear
low costtofunction
be the case.
value,Theandproblem
that it isremains
not imp an active
important
ortant toarea
find of researc
a true h, but
global experts
minimum
now susp
rather ect to
than that,
findfor a psufficien
oint in tly large neural
parameter spacenetw
thatorks,
has most
lo
low
w butlocal
notminima
minimal havcost
ea
(low cost
Saxe et function
al., 2013v;alue, and that
Dauphin it, is2014
et al. not; imp
Go
Goo ortant
odfellow to et
findal.a, true
2015global minimum
; Choromansk
Choromanska a
rather
et al. than
al.,, 2014). to find a p oint in parameter space that has low but not minimal cost
(Saxe et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2015; Choromanska
Man
Many
et al. y practitioners
, 2014 ). attribute nearly all difficulty with neural netw network
ork optimiza-
tion to lo local
cal minima. We encourage practitioners to carefully test for sp specific
ecific
Man y practitioners attribute
problems. A test that can rule out lo nearly all
local difficulty with neural netw
cal minima as the problem is to plot theork optimiza-
tion to lo cal minima.
norm of the gradient ov Wer time. If the norm of the to
over e encourage practitioners carefully
gradient do test
does
es notforshrink
specificto
problems.
insignificanttAsize,
insignifican testthe
that can rule
problem out local
is neither lo minima
local
cal minimaasnor thean problem
anyy other kindis toofplot the
critical
norm
p oint. of
oint. thekind
This gradient over etime.
of negativ
negative If the
test can rulenorm
out of
lo theminima.
local
cal gradientIndohighes not shrink to
dimensional
insignifican
spaces, t size,
it can be the
veryproblem
difficultis toneither
positiv loely
cal minima
ositively establishnor anylo
that other
local kind of are
cal minima critical
the
p oint. This kind of negativ e
problem. Many structures other than lo test can rule
local out lo cal minima.
cal minima also hav In high dimensional
havee small gradients.
spaces, it can be very difficult to positively establish that local minima are the
problem. Many structures other than local minima also have small gradients.
8.2.3 Plateaus, Saddle Points and Other Flat Regions

8.2.3
F or many Plateaus, Saddle non-con
high-dimensional Pointsvex
non-conv and Other Flat
functions, lo
local Regions(and maxima)
cal minima
are in fact rare compared to another kind of point with zero gradient: a saddle
Foin
p or t.
many
oint. Some high-dimensional
poin ts around a non-con
oints saddle pvoin ext functions,
oint hav
havee greaterlocal
costminima
than the (and maxima)
saddle p oin
oint,
t,
are inothers
while fact rarehav compared
a lo
havee lower wer to
cost.another
A t a kind
saddle ofp p oint
oint, with
the zero
Hessian gradient:
matrix a
has saddle
b oth
positiv
p oint. eSome
ositive points around
and negative eigen
eigenv vaalues.
saddlePoinpoin
ointstst lying
have along
greatereigenv
cost ectors
than the
eigenvectors assosaddle
ciatedpwith
associated oint,
while
positiv others
ositivee eigenv hav e a lo
eigenvalues
alues wer
hav
have cost. Atcost
e greater a saddle
than thepoint, the Hessian
saddle matrix
point, while poinhas
oints both
ts lying
p ositiv e and negative
along negative eigenv eigen
eigenvalues v
alues ha alues.
have
ve low P oin
lower ts lying along eigenv ectors
er value. We can think of a saddle poin asso ciatedointwith
t as
p
bositiv
eing ae loeigenv
local
cal alues
minimum hav e greater
along one cost than
cross-section the saddle
of the p oint,
cost while
function p oin
and ts
a lying
lo
local
cal
along
maxim
maximum negative
um alongeigenvanotheralues have lower vSee
cross-section. alue.
Fig.W4.5
e can
for think of a saddle point as
an illustration.
being a local minimum along one cross-section of the cost function and a local
Man
Many
maxim umy classes of random
along another functionsSee
cross-section. exhibit thefor
Fig. 4.5 following behavior: in low-
an illustration.
dimensional spaces, lo local
cal minima are common. In higher dimensional spaces, lo local
cal
Manare
minima y classes
rare and of saddle
random functions
points are more exhibit
common.the F following behavior:
or a function f : Rn → in Rlow-of
dimensional spaces,
this type, the exp expectedlo cal minima
ected ratio of the num are common.
numb In higher
ber of saddle poin dimensional
oints
ts to lo
local spaces,
cal minima lo
growscal
R R
minima
exp
exponen
onen are
onentially rare and saddle p oints are more common.
tially with n. To understand the intuition behind this beha For a function f :
ehavior,
vior, observe of
this type, the exp ected
that the Hessian matrix at a lo ratio of the
local num b er of saddle
cal minimum has only positiv p oin ts to lo cal
ositivee eigen minima
eigenv grows
→ The
values.
exp onen tially with n.
Hessian matrix at a saddle poinT o understand the intuition b ehind
ointt has a mixture of positive and negativ this b eha vior,
negativee eigenv observe
eigenvalues.
alues.
that the that
Imagine Hessian matrix
the sign athaeigenv
of eac
each local alue
minimum
eigenvalue has only
is generated by p ositive aeigen
flipping coin.values. The
In a single
Hessian matrix
dimension, it is at
easya saddle
to obtainpoinat lohas
calaminimum
local mixture of byptossing
ositive and negativ
a coin e eigenvheads
and getting alues.
Imagine that the sign of eac h
once. In n-dimensional space, it is exp eigenv alue is generated by flipping a coin. In
onentially unlikely that all n coin tosses will
exponentially a single
dimension, it is easy to obtain a local minimum by tossing a coin and getting heads
once. In n-dimensional space, it is exponentially 285 unlikely that all n coin tosses will
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

be heads. See Dauphin et al. (2014) for a review of the relev


relevant
ant theoretical work.
An amazing prop property
erty of many random functions is that the eigenv eigenvalues
alues of the
be heads. See Dauphin et al. (2014) for a review of the relevant theoretical work.
Hessian become more likely to be positive as we reach regions of lo lower
wer cost. In
An amazing
our coin tossing analogy prop erty of many random
analogy,, this means we are more lik functions is
likely that
ely to hav the eigenv
havee our coin aluescomeof the
up
Hessian b ecome more likely to b e p ositive as we
heads n times if we are at a critical point with low cost. This means that lo reach regions of lower cost. In
local
cal
our coinare
minima tossing
muc
much hanalogy
more lik , this
likely
ely tomeans
havee we
hav lo
lowware
costmorethanlikhighely to havCritical
cost. e our coin comewith
points up
headscost
high n times
are farif w e arelikely
more at a critical
to be saddlepoint pwithoin ts.low
oints. cost. This
Critical points means
with that local
extremely
minima
high costarearemucmoreh more
likelylikto
likely elybtoe lohav
calemaxima.
local low cost than high cost. Critical points with
high cost are far more likely to be saddle points. Critical points with extremely
highThiscosthapp
happens
are ens
moreforlikmany
ely toclasses
be local of maxima.
random functions. Do Does es it happ
happen en for neural
net
netwworks? Baldi and Hornik (1989) sho showed
wed theoretically that shallo shallow w autoenco
autoencoders ders
This
(feedforw
(feedforward happ
ard netens for
networks many classes
works trained to cop of random
copy functions. Do es
y their input to their output, describ it happ en for
describedneural
ed in
net w orks? Baldi and Hornik
Chapter 14) with no nonlinearities hav ( 1989 ) sho
havee global minima and saddle points butders
wed theoretically that shallo w autoenco no
(feedforw
lo
local ard net works trained to cop y their input
cal minima with higher cost than the global minimum. They observed without to their output, describ ed in
Chapter
pro
proofof that 14)these
withresults
no nonlinearities
extend to deep haveerglobal
deeper net
networksminima
works and saddle
without points butThe
nonlinearities. no
local minima
output of suc
suchwith
h netwhigher
networks
orks is cost than the
a linear globalofminimum.
function their input, They butobserved
they arewithoutuseful
pro of that
to study as a mo these
modelresults extend to
del of nonlinear neural netwdeep er net
networksworks without nonlinearities.
orks because their loss function The is
output
a non-conv of
non-convex suc h netw orks is a linear function
ex function of their parameters. Such netw of their input,
networks but they are
orks are essentially just useful
toultiple
m study matrices
as a model comp of osed
composed nonlinear
together. neuralSaxenetwet orks
al. (2013because
) pro their loss
provided
vided exact function
solutions is
a non-conv
to the complete ex function
learningofdynamics
their parameters.
in suc
such h netw Such
networksorksnetw andorks show
showed are essentially
ed that learningjust in
m ultiple
these mo matrices
models comp
dels captures man osed
many together.
y of the qualitativSaxe et al. ( 2013 ) pro vided
qualitativee features observed in the training of exact solutions
to the
deep mo complete
models learning
dels with nonlinear activdynamics
activation in suc
ation h networksDauphin
functions. and show etedal.that
(2014 learning
) sho
showed
wedin
these
exp
experimenmodels
erimen tallycaptures
erimentally that real man y of netw
neural the qualitativ
orks also ehav
networks features
have observed that
e loss functions in thecontain
training of
very
deepymo
man
many dels with
high-cost nonlinear
saddle poin activChoromansk
oints.
ts. ation functions.
Choromanska a et al. Dauphin
(2014) et al. (2014
provided ) showed
additional
experimentally
theoretical that real showing
arguments, neural netw that orks also hav
another e loss
class of functions that contain
high-dimensional very
random
man y high-cost saddle
functions related to neural net p oin ts.
netw Choromansk
works do does a et
es so as well. al. ( 2014 ) provided additional
theoretical arguments, showing that another class of high-dimensional random
What are
functions the implications
related to neural netof the proliferation
works does so as well. of saddle points for training algo-
rithms? For first-order optimization algorithms that use only gradient information,
the What are the
situation implications
is unclear. of the proliferation
The gradient can often bofecome saddlevery points for near
small training algo-
a saddle
rithms?
poin
oint. For first-order optimization algorithms that
t. On the other hand, gradient descent empirically seems to be able to escapuse only gradient information,
escapee
the situation
saddle points isinunclear.
man
many The gradient
y cases. Goo
Goodfellow
dfellowcanetoften
al. (b2015ecome very small
) provided near a saddle
visualizations of
p oin
sev t.
several On the
eral learning tra other hand,
trajectories gradient descent empirically
jectories of state-of-the-art neural net seems
netw works, with an examplee
to b e able to escap
saddle
giv en inpoints
given Fig. 8.2in .man
Thesey cases. Goodfellow
visualizations show et aal.flattening
(2015) providedof the cost visualizations
function near of
sev eral learning
a prominent saddle poin tra jectories of state-of-the-art
ointt where the weigh eights neural net w orks, with
ts are all zero, but they also show the an example
given int Fig.
gradien
gradient descent8.2.traThese
trajectory
jectoryvisualizations
rapidly escaping show athis flattening
region. of Goo thedfellow
Goodfellowcost function
et al. (2015near)
a prominent
also argue that saddle
con
continpoin
tin t wheregradient
tinuous-time
uous-time the weigh ts are ma
descent all yzero,
may be shownbut they also show
analytically tothe
be
gradien
rep
repelled t descent tra jectory rapidly
elled from, rather than attracted to, a nearb escaping this
nearby y saddle point, but the situation)
region. Goo dfellow et al. ( 2015
also
ma
may argue
y be differenthat con tin uous-time gradient
differentt for more realistic uses of gradient descent madescen
y be shown
descent. t. analytically to be
repelled from, rather than attracted to, a nearby saddle point, but the situation
mayFor be Newton’s
different for metho
method,
mored,realistic
it is clear uses that saddle descen
of gradient pointst.constitute a problem.

For Newton’s method, it is clear 286


that saddle points constitute a problem.
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

J(θ )


n 2o
Pro jec
tion 1 jectio
of θ Pro

Figure 8.2: A visualization of the cost function of a neural netw network.


ork. Image adapted
with p ermission from Go Goo o dfello
dfelloww et al. (2015). These visualizations app appear
ear similar for
Figure 8.2:
feedforw
feedforwardard A visualization
neural net works,ofcon
networks, thevolutional
cost function
convolutional net of a neural
networks,
works, network.netw
and recurrent Image
orks adapted
networks applied
with p ermission
to real obobject from Go o dfello w et al.
ject recognition and natural language pro ( 2015 ). These visualizations
processing app ear
cessing tasks. Surprisingly similar for
Surprisingly,, these
feedforward neural
visualizations usually netdo
works,
not sho con
show wvolutional networks,obstacles.
many conspicuous and recurrent
Priornetw orks
to the applied
success of
to real
sto
stoc ob ject
chastic recognition
gradient descentand for natural
traininglanguage
very large promo
cessing
dels btasks.
models eginning Surprisingly
in roughly, these
2012,
visualizations
neural net costusually
function dosurfaces
not show many
were conspicuous
generally believedobstacles.
to hav Prior
havee muc
much to thenon-conv
h more success ex
non-convexof
sto chasticthan
structure gradient
is rev descent
ealed byfor
revealed training
these pro very large
projections.
jections. Themo dels b eginning
primary obstacle in roughlyby2012,
revealed this
neural
pro net is
projection
jection cost function
a saddle surfaces
point of highwerecostgenerally
near where believed to have muc
the parameters arehinitialized,
more non-conv
but, ex
as
structure than is rev ealed by these
indicated by the blue path, the SGD training tra pro jections. The
trajectoryprimary
jectory escap
escapesobstacle revealed b y
es this saddle point readilythis
readily..
pro jection
Most is a saddle
of training timepisointsp of
entthigh
spen
en trav cost
traversing
ersingnear thewhere the parameters
relatively flat valleyare
of initialized, but, as
the cost function,
indicated
whic
whichh maybybethe dueblue path,noise
to high the SGD
in thetraining
gradien
gradient, tra
t, pjectory escapes this
oor conditioning of saddle point matrix
the Hessian readily.
Most of training time is sp en t trav ersing the relatively flat valley
in this region, or simply the need to circumnavigate the tall “mountain” visible in the of the cost function,
whic
figureh may
via anbeindirect
due to higharcingnoise
path. in the gradient, poor conditioning of the Hessian matrix
in this region, or simply the need to circumnavigate the tall “mountain” visible in the
figure via an indirect arcing path.

287
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Gradien
Gradientt descent is designed to mov movee “do
“downhill”
wnhill” and is not explicitly designed
to seek a critical point. Newton’s metho method, d, how
however,
ever, is designed to solv solvee for a
Gradien
poin t descent
ointt where the gradien is designed to mov e “do wnhill”
gradientt is zero. Without appropriate mo and is not explicitly
modification,
dification, it can designed
jump
to seek a critical p oint. Newton’s metho
to a saddle point. The proliferation of saddle poin d, how ever,
oints is designed to
ts in high dimensional spacessolv e for a
p oin t where the gradien t is zero.
presumably explains why second-order metho Without appropriate
methods ds hav mo dification, it
havee not succeeded in replacing can jump
to a saddle
gradien
gradient point.forThe
t descent proliferation
neural netw
network of saddle pDauphin
ork training. oints in high
et al.dimensional
(2014) introducedspaces
apresumably
sadd le-freeeexplains
saddle-fr
le-fr Newtonwhy methosecond-order
method methodsoptimization
d for second-order have not succeededand show ined
showed replacing
that it
gradien
impro
improv t descent for neural netw ork training. Dauphin
ves significantly over the traditional version. Second-order metho et al. ( 2014 )
methods introduced
ds remain
a sadd le-fr ee Newton metho
difficult to scale to large neural net d for second-order
networks, optimization and
works, but this saddle-free approac show
approach ed hthat it
holds
improvesifsignificantly
promise over the traditional version. Second-order methods remain
it could be scaled.
difficult to scale to large neural networks, but this saddle-free approach holds
There are other kinds of points with zero gradient besides minima and saddle
promise if it could be scaled.
poin
oints.
ts. There are also maxima, whic which h are muc
muchh lik
likee saddle poin oints
ts from the
persp There
ectivee of optimization—many algorithms are not attracted to and
erspectiv
ectiv are other kinds of p oints with zero gradient besides minima them, saddle
but
p oints.dified
unmo
unmodified There are alsomethod
Newton’s maxima, is. whic h are bmuc
Maxima h lik
ecome e saddle
exp
exponentially
onentially poinrare
ts from the
in high
perspective of
dimensional optimization—many
space, just like minima do. algorithms are not attracted to them, but
unmodified Newton’s method is. Maxima become exponentially rare in high
There may also be wide, flat regions of constant value. In these lo locations,
cations, the
dimensional space, just like minima do.
gradien
gradientt and also the Hessian are all zero. Suc Such
h degenerate lo locations
cations pose ma major
jor
There for
problems mayallalso be wide,optimization
numerical flat regions algorithms.
of constant vInalue.
a convIn ex
convex these locations,
problem, the
a wide,
gradien
flat t and
region mustalsoconsist
the Hessian
en
entirely
tirelyareofall zero. minima,
global Such degenerate locationsoptimization
but in a general pose ma jor
problems such
problem, for alla numerical
region could optimization
corresp
correspondond algorithms. In aofconv
to a high value theexob problem,
objectiv
jectiv a wide,
jectivee function.
flat region must consist entirely of global minima, but in a general optimization
problem, such a region could correspond to a high value of the ob jective function.
8.2.4 Cliffs and Explo
Exploding
ding Gradients

8.2.4 netw
Neural Cliffsorksand
networks withExplo ding
many lay ers Gradients
layers often hav
havee extremely steep regions resembling
cliffs, as illustrated in Fig. 8.3. These result from the multiplication of several large
Neural
w eigh ts netw
eights orks with
together. many
On the layof
face ersanoften have extremely
extremely steep cliffsteep regionsthe
structure, resembling
gradient
cliffs,
up
update as illustrated
date step can mo in
movve the parameters extremely far, usually jumping off oflarge
Fig. 8.3 . These result from the m ultiplication of several the
w eigh ts together. On
cliff structure altogether.the face of an extremely steep cliff structure, the gradient
update step can move the parameters extremely far, usually jumping off of the
cliff structure altogether.

288
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

J(w; b)
w
b

Figure 8.3: The ob objectiv


jectiv
jectivee function for highly nonlinear deep neural netw networks
orks or for
recurren
recurrentt neural netw
networks
orks often contains sharp nonlinearities in parameter space resulting
Figurethe
from 8.3: The ob jectivofe several
multiplication functionparameters.
for highly nonlinear deep neuralgive
These nonlinearities netwrise
orkstoorvery
for
recurren t
high deriv neural
derivatives netw orks often contains sharp nonlinearities in parameter space resulting
atives in some places. When the parameters get close to such a cliff region, a
from
gradienthe
gradient multiplication
t descent up
update of several
date can catapultparameters. These
the parameters verynonlinearities give rise
far, possibly losing mosttoofvery
the
high derivatives
optimization in that
work somehad
places.
b eenWhen
done.the parameters
Figure adaptedget close
with to such afrom
p ermission cliff region,
Pascanu a
gradien t descent
et al. (2013a). up date can catapult the parameters very far, p ossibly losing most of the
optimization work that had b een done. Figure adapted with p ermission from Pascanu
et al. (2013a).
The cliff can be dangerous whether we approach it from ab abo ove or from below,
but fortunately its most serious consequences can be avoided using the gr gradient
adient
The cliff
clipping can bedescrib
heuristic dangerous
describeded in whether we approach
Sec. 10.11.1 . The basic it from
ideaabis oto
ve recall
or from below,
that the
but fortunately
gradien
gradientt do
does its
es not sp most
specify serious consequences can b e av oided using
ecify the optimal step size, but only the optimal direction the gradient
clipping heuristic describ
within an infinitesimal region. ed in When
Sec. 10.11.1 . The basic
the traditional idea isdescen
gradient to recall
descent that the
t algorithm
gradien
prop
proposes t do es not sp ecify the optimal step
oses to make a very large step, the gradien size, but only the optimal
gradientt clipping heuristic interv direction
intervenes
enes to
within
reduce an
theinfinitesimal
step size to bregion. When the
e small enough thattraditional gradient
it is less likely to godescen
outsidet algorithm
the region
proposes
where the to make aindicates
gradient very largethestep, the gradien
direction t clipping heuristic
of approximately steep
steepest intervenes
est descent. to
Cliff
reduce the are
structures stepmost
size to b e smallinenough
common the costthat it is lessfor
functions likely to go outside
recurrent neural the
netwregion
networks,
orks,
where the
because suc gradient
such
h mo
modelsindicates
dels inv
involve the direction of approximately steep est
olve a multiplication of many factors, with one factor descent. Cliff
structures
for eac
each are step.
h time most common
Long temp inoral
the cost
temporal functions
sequences th usfor
thus recurrent
incur neuralamount
an extreme networks,
of
b ecause such
multiplication. mo dels inv olve a multiplication of many factors, with one factor
for each time step. Long temporal sequences thus incur an extreme amount of
multiplication.
8.2.5 Long-T
Long-Term
erm Dep
Dependencies
endencies

8.2.5 difficulty
Another Long-Tthat ermneural
Dep endencies
netw
network
ork optimization algorithms must overcome arises
when the computational graph becomes extremely deep. F Feedforward
eedforward netw
networks
orks
Another
with manydifficulty
lay
layers that
ers have neural
hav netwcomputational
such deep ork optimization algorithms
graphs. So domrecurrent
ust overcome
net arises
networks,
works,
when the
describ ed computational
described in Chapter 10,graph
whic
which hbecomes extremely
construct very deepdeep. Feedforwardgraphs
computational networksby
with many layers have such deep computational graphs. So do recurrent networks,
described in Chapter 10, which construct 289
very deep computational graphs by
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

rep
repeatedly
eatedly applying the same op operation
eration at each time step of a long temp temporal
oral
sequence. Rep
Repeated
eated application of the same parameters gives rise to esp especially
ecially
repeatedly applying
pronounced difficulties.the same operation at each time step of a long temporal
sequence. Repeated application of the same parameters gives rise to especially
For example,
pronounced supp
suppose
ose that a computational graph con
difficulties. contains
tains a path that consists
of rep eatedly multiplying by a matrix W . After t steps, this is equiv
repeatedly equivalent
alent to mul-
For example,
tiplying suppose that
by W t . Suppose a computational
that W graph con
has an eigendecomp
eigendecompositiontainsW
osition a path that( consists
= V diag
diag( λ)V −1.
of rep
In thiseatedly
simplemultiplying Wto
by a matrixard
case, it is straightforw
straightforward . After t steps, this is equivalent to mul-
see that
tiplying by W . Suppose that
 W has an eigendecomp
 osition W = V diag( λ)V .
−1 t
In this simple case,W t
it is=straightforw
V diag
diag((λ)V diag(λ) tV −1.
= Vthat
ard to see (8.11)
An
Anyy eigenv alues λi W
eigenvalues that=areVnot diagnear
(λ)Van absolute value
= V diag (λ) ofV 1 will. either explo
explodede if
(8.11)
they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude.
They vanishing
An eigenvaluesand λ that
explo are
dingnot
exploding gr near anpr
gradient
adient absolute
problem
oblem refersvaluetoofthe 1 will
facteither
that explo de if
gradients
they are greater
through than 1are
such a graph in magnitude
also scaledoraccording
vanish if tothey
diagare
diag( ( λless
) t. Vthan 1 in magnitude.
anishing gradients
The
makeevanishing
mak it difficultand explo
to kno
knoww ding
 grdirection
which adient problem refers to should
the parameters the factmomovthat
ve togradients
impro
improv ve
through such a graph are
the cost function, while explo also
explodingscaled according to diag ( λ ) . V anishing
ding gradients can make learning unstable. The cliff gradients
mak e it difficult
structures describ to
described ed earlier that direction
kno w which motiv ate the
motivate parameters
gradien
gradient t clipping should
are an moexample
ve to impro ve
of the
the cost
explo dingfunction,
exploding gradientwhile exploding gradients can make learning unstable. The cliff
phenomenon.
structures described earlier that motivate gradient clipping are an example of the
The
explo repeated
ding phenomenon. by W at eac
gradientmultiplication eachh time step described here is very
similar to the power metho
method d algorithm used to find the largest eigen eigenvvalue of a matrix
The repeated
W and the corresp multiplication
corresponding
onding eigenv by
eigenvector. W at eac h time step described
ector. From this point of view it is not here is very
surprising
similar
that x>toWthe power
t will evenmetho
tuallyd discard
eventually algorithmallused
comp toonents
find the
components largest
of x that are eigen value of a matrix
orthogonal to the
W and the
principal corresp
eigenv
eigenvector onding
ector of Weigenv
. ector. From this point of view it is not surprising
that x W will eventually discard all components of x that are orthogonal to the
Recurren
Recurrent
principal t net
eigenvnetwworks
ector W .the same matrix W at eac
of use eachh time step, but feedforward
net
netwworks do not, so even very deep feedforward netw networks
orks can largely avoid the
Recurren
vanishing andt net w
explo orks
exploding
ding use the
gradientsame matrix
problem ( W at eac
Sussillo, h
2014time
). step, but feedforward
networks do not, so even very deep feedforward networks can largely avoid the
We defer a further discussion of the challenges of training recurrent net netwworks
vanishing and exploding gradient problem (Sussillo, 2014).
un
until
til Sec. 10.7, after recurren
recurrentt netw
networks
orks ha
haveve been describ
described ed in more detail.
We defer a further discussion of the challenges of training recurrent networks
until Sec. 10.7, after recurrent networks have been described in more detail.
8.2.6 Inexact Gradien
Gradients
ts

8.2.6optimization
Most Inexact Gradien
algorithmstsare primarily motiv
motivated
ated by the case where we ha havve
exact knowledge of the gradient or Hessian matrix. In practice, we usually only
Most
ha
havve aoptimization
noisy or ev enalgorithms
even are primarily
biased estimate motivated by
of these quantities. the case
Nearly everywhere we have
deep learning
exact knowledge of the gradient or Hessian matrix. In practice, we usually
algorithm relies on sampling-based estimates at least insofar as using a minibatch only
havtraining
of e a noisyexamples
or even biased estimate
to compute theofgradien
these quantities.
gradient.
t. Nearly every deep learning
algorithm relies on sampling-based estimates at least insofar as using a minibatch
In other cases, the obobjectiv
jectiv
jectivee function we wan
antt to minimize is actually in
intractable.
tractable.
of training examples to compute the gradient.
When the ob objective
jective function is intractable, typically its gradien
gradientt is in
intractable
tractable as
In other cases, the ob jectiv e
well. In such cases we can only appro function
approximatewe wan t to minimize is actually intractable.
ximate the gradient. These issues mostly arise
When the ob jective function is intractable, typically its gradient is intractable as
well. In such cases we can only approximate 290 the gradient. These issues mostly arise
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

with the more adv


advanced
anced mo
models
dels in Part III. For example, con contrastiv
trastiv
trastivee divergence
giv
gives
es a tec
technique
hnique for appro
approximating
ximating the gradient of the intractable log-liklog-likeliho
eliho
elihoood
with the more adv anced
of a Boltzmann machine. mo dels in Part I I I . For example, con trastiv e divergence
gives a technique for approximating the gradient of the intractable log-likelihood
of aVBoltzmann
arious neural netw
network
machine.ork optimization algorithms are designed to account for
imp
imperfections
erfections in the gradien
gradientt estimate. One can also avoid the problem by cho hoosing
osing
Various loss
a surrogate neural network
function thatoptimization algorithms are
is easier to approximate designed
than to loss.
the true account for
imperfections in the gradient estimate. One can also avoid the problem by choosing
a surrogate loss function that is easier to approximate than the true loss.
8.2.7 Po or Corresp
Correspondence
ondence bet
etw
ween Lo
Local
cal and Global Structure

8.2.7
Man
Many y of P o orproblems
the Corresp weondence
ha
hav between
ve discussed so far Locorresp
cal and
correspond ondGlobal
to prop Structure
properties
erties of the
loss function at a single point—it can be difficult to make a single step if J (θ ) is
Man
poorly y of the problems
conditioned at thewecurrent
have discussed
poin
ointt θ, or so iffarθ corresp
lies on aond cliff,toor prop
if θerties of the
is a saddle
loss
p ointtfunction
oin hiding the at aopp single
opportunity point—it
ortunity can progress
to make be difficult to make
downhill from a single step if J (θ ) is
the gradient.
poorly conditioned at the current point θ, or if θ lies on a cliff, or if θ is a saddle
It is possible to ov overcome
ercome all of these problems at a single point and still
point hiding the opportunity to make progress downhill from the gradient.
perform poorly if the direction that results in the most impro improv vemen
ementt lolocally
cally do does
es
not It is pto
point ossible
toward
ward to ov
distant ercome
regions allofofmucthese
much low
h lowerproblems
er cost. at a single point and still
perform poorly if the direction that results in the most improvement locally does
not Go
Goo odfello
point dfellow w et distant
toward al. (2015 ) argueofthat
regions mucmucmuch
h low herofcost.
the run
runtime
time of training is due to
the length of the tra trajectory
jectory needed to arrive at the solution. Fig. 8.2 sho showsws that
Go o dfello
the learning tra w et al.
trajectory (
jectory sp 2015 )
spendsargue that muc h of the run time
ends most of its time tracing out a wide arc around of training is due to a
the
moun length
mountain-shap
tain-shapof
tain-shaped the
ed tra jectory
structure. needed to arrive at the solution. Fig. 8.2 sho ws that
the learning tra jectory spends most of its time tracing out a wide arc around a
moun Muc
Much h of research
tain-shap in
into
ed structure. to the difficulties of optimization has fo focused
cused on whether
training arrives at a global minimum, a lo local
cal minimum, or a saddle point, but in
Muc h of research
practice neural netw networksorks do not arrive at aoptimization
in to the difficulties of critical pointhas of fo cused
any on whether
kind. Fig. 8.1
training
sho
shows
ws that arrives
neuralatnetw a global
networks minimum,
orks often do notaarrive
local at minimum,
a region of or small
a saddle point,Indeed,
gradient. but in
practice
suc
such neural
h critical netwdo
points orksnotdo notnecessarily
even arrive at aexist. criticalForpoint of any
example, thekind. Fig. 8.1
loss function
sholog
− wspthat
(y | neural
x; θ) can netwlackorks aoften do not
global arrive atpoint
minimum a region
andofinstead
small gradient. Indeed,
asymptotically
suc h
approac
approachcritical p oints do
h some value as the mo not even
model necessarily exist. F or example,
del becomes more confident. For a classifier with the loss function
log p(yy and
discrete x; θp) (can
y | xlack a globalbyminimum
) provided a softmax, point
the and instead
negative asymptotically
log-lik
log-likeliho
eliho
elihoo od can
approac
− h some
| v alue as the
become arbitrarily close to zero if the mo mo del b ecomes
model more confident. F or a
del is able to correctly classify ev classifier with
every
ery
discrete y and p (y x ) provided
example in the training set, but it is imp by a softmax,
impossible the negative log-lik
ossible to actually reach the value of eliho o d can
b ecome arbitrarily
zero. Likewise, a mo modelclose
|del of real values p( y | x)del
to zero if the mo = isN able
(y ; f (to
θ), correctly
β −1) can ha classify
hav every
ve negative
example
log-lik
log-likeliho
elihoinothe
elihoo d thattraining set, buttoitnegativ
asymptotes is impeossible
negative infinity—if to actually
f (θ) isreach
able the value of
to correctly
zero. Likewise,
predict the value a mo of del
all training set y ptargets,
of real values ( y x) =the learning
(y ; f (θ), βalgorithm
) can ha ve negative
will increase
log-lik eliho o d that asymptotes to negativ
β without bound. See Fig. 8.4 for an example of a failure of lo| e infinity—if
N f (θ ) is
localable to correctly
cal optimization to
predict
find a go goothe value of all training set y targets,
od cost function value even in the absence of any lo the learning algorithm
local
cal minima willorincrease
saddle
β without
poin
oints.
ts. b ound. See Fig. 8.4 for an example of a failure of lo cal optimization to
find a good cost function value even in the absence of any local minima or saddle
Future research will need to develop further understanding of the factors that
points.
influence the length of the learning tra trajectory
jectory and better characterize the outcome
Future research will need to develop further understanding of the factors that
influence the length of the learning tra jectory 291 and better characterize the outcome
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

J(θ )

Figure 8.4: Optimization based on lo local


cal do
downhill
wnhill mov
moves
es can fail if the lo
local
cal surface dodoes
es
not point towtoward
ard the global solution. Here we provide an example of how this can occur,
Figure
even if 8.4:
even thereOptimization
are no saddle based on loand
p oints cal do
nownhill
lo cal mov
local es canThis
minima. fail ifexample
the localcost
surface does
function
not
con p oint
contains tow ard the global
tains only asymptotes to solution.
toward Here w e provide an example of how
ward low values, not minima. The main cause of difficult this can o
difficultyccur,
y in
evencase
this if there are initialized
is b eing no saddle on p oints and noside
the wrong lo cal
of minima. This example
the “mountain” and not cost
b eingfunction
able to
converse
tra tains it.
traverse onlyInasymptotes toward low
higher dimensional values,
space, not minima.
learning The can
algorithms mainoften
causecircumnavigate
of difficulty in
this
suc
such case is b eing initialized
h mountains but the tra on the
trajectory wrong
jectory asso ciated with doing so may b e long band
side
associated of the “mountain” and not eingresult
able to
in
traverseeit.
excessiv
excessive In higher
training time,dimensional space,
as illustrated learning
in Fig. 8.2. algorithms can often circumnavigate
such mountains but the tra jectory asso ciated with doing so may b e long and result in
excessive training time, as illustrated in Fig. 8.2.
of the pro
process.
cess.
Man
Many y existing research directions are aimed at finding go goood initial points for
of the process.
problems that hav havee difficult global structure, rather than developing algorithms
Man y existing
that use non-lo
non-localcal research
mov
moves.es. directions are aimed at finding good initial points for
problems that have difficult global structure, rather than developing algorithms
Gradien
Gradientt descent and essentially all learning algorithms that are effective for
that use non-local moves.
training neural net netwo
wo
works
rks are based on making small, lo local
cal mov
moves.
es. The previous
Gradien
sections hav t descent and
havee primarily fo essentially
focused all learning algorithms
cused on how the correct direction of these that are effective
lo
local
cal mov
movesfor
es
training neural net wo rks are based on making
can be difficult to compute. We may be able to compute some propsmall, lo cal mov es. The
propertiesprevious
erties of the
sections
ob
objectiv
jectiv hav e primarily fo cused on how the correct
approximately,, with bias lo
jectivee function, such as its gradient, only approximately direction of these orcal moves
variance
canour
in be estimate
difficult of to the
compute.
correctW e may beIn
direction. able to compute
these cases, lo some
local
cal properties
descent ma
may y orofma
they
may
ob jectiv
not definee function, such as
a reasonably its gradient,
short path to aonly validapproximately
solution, but , with biasnot
we are or vactually
ariance
in ourtoestimate
able follow the of the
lo
localcorrect
cal descentdirection.
path. The In these
ob cases,
objectiv
jectiv
jective local descent
e function may maha
hav yeorissues
v may
not
suc
such define a reasonably short
h as poor conditioning or discontin path to a
discontinuous valid solution,
uous gradien
gradients, but we are not
ts, causing the region whereactually
ablegradient
the to follow prothe
provideslocal
vides descent
a go
goood mo
modelpath.
del of theTheob ob jectiv
objectiv
jectiv
jective e function
e function to bmay hasmall.
e very ve issuesIn
suc h as p o or
these cases, lo conditioning or discontin uous
cal descent with steps of size  ma
local gradien
may ts, causing the region
y define a reasonably short path where
the gradient pro vides a go o d mo del of the
to the solution, but we are only able to compute the ob jective function
lo
local to be direction
cal descent very small. In
with
these cases, lo cal descent with
steps of size δ  . In these cases, lo steps of
localsize  ma
cal descent may define
may a
y or mareasonably
may short
y not define a path path
to the
to the solution,
solution, butbut the
we arepathonly able toman
contains compute
many y steps,the
so lo cal descent
following thedirection
path incurs witha
steps of size δ . In these cases, local descent may or may not define a path
to the solution,but the path contains man 292 y steps, so following the path incurs a
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

high computational cost. Sometimes lo local


cal information provides us no guide, when
the function has a wide flat region, or if we manage to land exactly on a critical
high
p ointtcomputational
oin (usually this lattercost. Sometimes
scenario onlylocal
happinformation
happens ens to metho provides
methods ds thatus solve
no guide, when
explicitly
the critical
for functionpoints,
has a such
wide asflatNewton’s
region, ormetho
if wed).
method). manage
In theseto land
cases,exactly
lo
local on a critical
cal descent do
does
es
p oin t (usually this latter scenario only happ
not define a path to a solution at all. In other cases, lo ens to metho
local ds
cal mov
movesthat solve
es can be to explicitly
tooo greedy
for critical
and lead uspalong
oints, asuchpathasthat
Newton’s
mov
moveses metho
downhilld). but
In these
awa
way y cases, localsolution,
from any descent as doin
es
not define
Fig. 8.4, ora along
path to ana unnecessarily
solution at all.long
In other
tra cases, to
trajectory
jectory lo cal
themov es can as
solution, beintooFig.
greedy
8.2.
and
Currenlead
Currently tlyus along a path that mov es downhill but a wa y
tly,, we do not understand which of these problems are most relev from any solution,
relevan as
an in
antt to
Fig. 8.4 , or along
making neural netw an
network unnecessarily long tra jectory to the solution,
ork optimization difficult, and this is an active area of researc as in Fig. 8.2
research.h..
Currently, we do not understand which of these problems are most relevant to
Regardless
making neural ofnetwwhic
which
orkh optimization
of these problems are most
difficult, significant,
and this all ofarea
is an active them of might
researcbh.
e
avoided if there exists a region of space connected reasonably directly to a solution
by aRegardless
path thatoflo whic
local h of these
cal descen
descent problems
t can follow, are
andmost
if wesignificant,
are able to allinitialize
of them might
learningbe
av oided if there exists
within that well-behav
well-behaved a region of space connected reasonably directly
ed region. This last view suggests research into choosing to a solution
b
goyoad path
goo initialthat
poinlotscal
oints fordescen t can optimization
traditional follow, and ifalgorithms
we are able toto initialize learning
use.
within that well-behaved region. This last view suggests research into choosing
good initial points for traditional optimization algorithms to use.
8.2.8 Theoretical Limits of Optimization

8.2.8
Sev
Several Theoretical
eral theoretical Limits
results sho
showw of
thatOptimization
there are limits on the performance of an anyy
optimization algorithm we migh mightt design for neural net netw
works (Blum and Rivest,
Sev eral theoretical results sho w that there
1992; Judd, 1989; Wolpert and MacReady, 1997). Typicallyare limits on the pthese
erformanceresultsofha
an
vey
hav
optimization
little bearing onalgorithm
the use w ofe neural
might netw
design
networks
orksforinneural networks (Blum and Rivest,
practice.
1992; Judd, 1989; Wolpert and MacReady, 1997). Typically these results have
Some theoretical results apply only to the case where the units of a neural
little bearing on the use of neural networks in practice.
net
netwwork output discrete values. Ho Howwev
ever,
er, most neural net network
work units output
smo Some
smoothly theoretical results apply only to the
othly increasing values that make optimization via lo case where
local the
cal searc units
search of a neural
h feasible. Some
network output
theoretical resultsdiscrete
sho
show values.
w that there Ho wev
exist er, mostclasses
problem neuralthat
netare
work in units output
intractable,
tractable, but
smo othly increasing values that make optimization via lo cal searc
it can be difficult to tell whether a particular problem falls into that class. Other h feasible. Some
theoretical
results showresults show that
that finding there exist
a solution for a problem
netw
network
ork ofclasses that
a given are
size is in tractable, but
intractable, but
it can b e difficult to tell whether a particular problem
in practice we can find a solution easily by using a larger netw falls into
network that class.
ork for which man Other
many y
resultsparameter
more show that settings
finding acorresp
solution
ondfortoa an
correspond netw ork of a given
acceptable size isMoreov
solution. intractable,
Moreover, er, in but
the
in
con practice
context we cannetw
text of neural findork
networka solution
training,easily by using
we usually do anotlarger
carenetw
ab
aboutorkfinding
out for which the man
exacty
more
minim
minimum parameter settings corresp ond to an acceptable solution.
um of a function, but only in reducing its value sufficiently to obtain go Moreov er, in the
goo
od
con text of neural netw ork training, we usually do not care ab
generalization error. Theoretical analysis of whether an optimization algorithmout finding the exact
minim
can um of a function,
accomplish this goalbut only in reducing
is extremely difficult.itsDev value sufficiently
Developing
eloping to obtain
more realistic good
bounds
generalization
on the performanceerror. ofTheoretical
optimization analysis of whether
algorithms an optimization
therefore algorithm
remains an important
can accomplish
goal for mac hinethis
machine goal isresearc
learning extremely
research.
h. difficult. Developing more realistic bounds
on the performance of optimization algorithms therefore remains an important
goal for machine learning research.

293
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

8.3 Basic Algorithms

8.3
W haveeBasic
e hav Algorithms
previously introduced the gradien
gradientt descent (Sec. 4.3) algorithm that
follo
follows
ws the gradient of an entire training set downhill. This may be accelerated
W e hav e previously
considerably by usingintroduced
sto the gradien
stocchastic gradien
gradient t descent
t descen
descent (Sec.
t to follow 4.3) algorithm
the gradien
gradient that
t of randomly
follows the gradient
selected minibatches doof an entire
downhill, training set downhill. This may
wnhill, as discussed in Sec. 5.9 and Sec. 8.1.3. b e accelerated
considerably by using stochastic gradient descent to follow the gradient of randomly
selected minibatches downhill, as discussed in Sec. 5.9 and Sec. 8.1.3.
8.3.1 Sto
Stocchastic Gradient Descent

8.3.1
Sto
Stoc Sto
chastic chasticdescent
gradient Gradient
(SGD) Descent
and its varian
ariants
ts are probably the most used
optimization algorithms for machine learning in general and for deep learning in
Stochastic gradient
particular. descent
As discussed (SGD)
in Sec. 8.1.3and
, it isitspossible
variants
to are probably
obtain the most
an unbiased used
estimate
optimization
of gradienttalgorithms
the gradien for machine
by taking the learningon
average gradient inageneral and
minibatch offor
minibatch m deep learning
examples dra in
drawn
wn
particular.
i.i.d Asdata
from the discussed in Sec.distribution.
generating 8.1.3, it is possible to obtain an unbiased estimate
of the gradient by taking the average gradient on a minibatch of m examples drawn
Algorithm 8.1 shoshows
ws how to follow this estimate of the gradiengradientt downhill.
i.i.d from the data generating distribution.
Algorithm
Algorithm 8.1Sto
8.1 sho
Stoc ws how
chastic to follow
gradient thistestimate
descen
descent (SGD) up ofdate
the at
update gradien t downhill.
training iteration k
Require: Learning8.1 Stocrate  .
Algorithm hastick gradient descent (SGD) update at training iteration k
Require: Initial parameter θ
Require: Learning
while stopping rate  .not met do
criterion
Require:
SampleInitial parameter
a minibatch of mθexamples from the training set {x(1), . . . , x (m)} with
while
correspstopping
ondingcriterion
corresponding targets ynot(i). met do
Sample a minibatch of m examples P training(i)set x (i), . . . , x
Compute gradient estimate: gˆ ← +from1
m∇ θ
the
i L(f (x ; θ ), y )
with
correspup
Apply onding
date: targets
update: θ←θ− y g ˆĝ. { }
Compute
end while gradient estimate: gˆ + L ( f ( x ; θ ) , y )
Apply update: θ θ g ˆ ← ∇
end while ← −
A crucial parameter for the SGD algorithm is the learning rate. Previously Previously,, we
ha
havve describ
described ed SGD as using a fixed learning rate . In practice, it is necessary to
A crucial parameter for the SGD algorithm Pis the learning rate. Previously, we
gradually decrease the learning rate ov
over
er time, so we now denote the learning rate
haviteration
at e describkedasSGD  k. as using a fixed learning rate . In practice, it is necessary to
gradually decrease the learning rate over time, so we now denote the learning rate
This is because the SGD gradien gradientt estimator in intro
tro
troduces
duces a source of noise (the
at iteration k as  .
random sampling of m training examples) that do doeses not vanish ev even
en when we arrive
This
at a minim is
minimum. b ecause the SGD gradien
um. By comparison, the true gradien t estimator intro duces a source
gradientt of the total cost function of noise (the
becomes
random
small andsampling
then 0 of m training
when we approach examples)andthat doaesminimum
reach not vanishusing even batch
when we arrive
gradient
at a minim
descen
descent,t, soum. By comparison,
batch gradient descent the truecangradien
use at fixed
of thelearning
total costrate.
function becomest
A sufficien
sufficient
small and to
condition then 0 when
guaran
guarantee we approach
tee conv ergence ofand
convergence SGD reach a minimum using batch gradient
is that
descent, so batch gradient descent can use a fixed learning rate. A sufficient
X ∞
condition to guarantee convergence of SGD is that
k = ∞, and (8.12)
k=1
 =294 , and (8.12)

X
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS


X
2k < ∞. (8.13)
k=1
 < . (8.13)
In practice, it is common to decay the
∞ learning rate linearly un
until
til iteration τ :

In practice, it is common to  k decay


= (1 −the α)learning
0 + α τ rate linearly until iteration (8.14)τ:
X
with α = kτ . After iteration τ , it = is (1
common α) to + αlea
leavve  constant. (8.14)
−by trial and error, but it is usually best
withThe α =learning
. Afterrate may b
iteration τ e, itchosen
is common to leave  constant.
to choose it by monitoring learning curves that plot the ob objective
jective function as a
function of time. This is more of an art than a science, and mostitguidance
The learning rate may b e chosen b y trial and error, but is usually on bthis
est
to
sub choose
subject it by monitoring learning curves that plot
ject should be regarded with some skepticism. When using the linear schedule, the ob jective function as a
function
the of time.toThis
parameters choose is more
are 0of , an art than a science, and
y bemost
set toguidance on
berthis
τ , and τ . Usually τ ma may the num
numb of
sub ject should b e regarded with some skepticism.
iterations required to make a few hundred passes through the training set. UsuallyWhen using the linear schedule,
the parameters to choose are  ,  , and τ. Usually τ may be set to the number of
τ should b e set to roughly 1% the value of 0. The main question is how to set 0 .
iterations
If it is to
too orequired
large, the to make
learning a few curvehundred
will shopasses
show through
w violen
violent the trainingwith
t oscillations, set. the
Usually
cost
 should often
function be setincreasing
to roughlysignifican
1% the vtly
significantly alue
tly. of  . The
. Gentle main question
oscillations are fine,is howesp to set  if.
especially
ecially
If it is towith
training o large,
a sto the
stoc learning
chastic cost curve
function willsuc sho
such hw as violen t oscillations,
the cost function arisingwith from
the cost
the
function
use of drop often
dropout. increasing significan
out. If the learning rate is to tly .
too Gentle
o lo
low, oscillations
w, learning pro are
proceeds fine,
ceeds slowly esp ecially
slowly,, and if the if
training with a sto
initial learning rate is to chastic
too cost function suc h as the cost function
o low, learning may become stuck with a high cost value. arising from the
Tuse of drop
ypically
ypically, out.optimal
, the If the learning rate is rate,
initial learning too loin w,terms
learning proceeds
of total slowly
training , and
time andif the
the
initialcost
final learning
value,rate is too than
is higher low, the learning
learningmayrate becomethat stuck
yieldswith
the baesthigh cost value.
performance
T ypically
after the ,first
the 100optimal initialorlearning
iterations rate, in it
so. Therefore, terms of total
is usually training
best time and
to monitor the
the first
final
sev
severalcost value, is higher than the learning rate
eral iterations and use a learning rate that is higher than the best-p that yields the b est p erformance
est-performing
erforming
after the rate
learning first at 100thisiterations
time, but or so.
not Therefore,
so high that it is
it usually best toinstabilit
causes severe monitorythe
instability . first
several iterations and use a learning rate that is higher than the best-performing
The most imp important
ortant propproperty
erty of SGD and related minibatch or online gradien gradient- t-
learning rate at this time, but not so high that it causes severe instability.
based optimization is that computation time per up update
date do does
es not grogroww with the
numumbThe most imp ortant prop erty
ber of training examples. This allows conv of SGD and related
ergence even when the gradien
convergence minibatch or online num
numb bert-
based
of optimization
training examplesisbthat ecomescomputation
very large.time Forpaerlargeupdate does dataset,
enough not growSGD withmaythe
n
conumvberge
conv er oftotraining
within some examples.
fixed This allows
tolerance of conv ergence
its final test even whenbthe
set error eforenum ber
it has
of
pro training
processed
cessed the examples
en
entire becomesset.
tire training very large. For a large enough dataset, SGD may
converge to within some fixed tolerance of its final test set error before it has
To study the conv convergence
ergence rate of an optimization algorithm it is common to
processed the entire training set.
measure the exc excess
ess err or J ( θ) − min θ J (θ ), which is the amoun
error amountt that the curren currentt
T o study the conv
cost function exceeds the minim ergence
minimum rate of an optimization algorithm
um possible cost. When SGD is applied to a conv it is common convexto
ex
measure the
problem, the exc ess err
excess error J ( θ) min J ( θ
or is O ( √1 ) after k )iterations,, which is the while amoun t that
in the the curren
strongly conv
convexext
cost function 1exceeds the minim−um k p ossible cost. When SGD is applied to a convex
case it is O( k ). These bounds cannot be impro improved ved unless extra conditions are
problem, the excess error is O ( ) after k iterations, while in the strongly convex
assumed. Batch gradien gradientt descent enjoys better conv convergence
ergence rates than sto stochastic
chastic
case
gradien it is O ( ).
gradientt descent in theoryThese b ounds
theory.. Ho Howev
wev cannot
wever, b e impro ved unless extra
er, the Cramér-Rao bound (Cramér, 1946; Rao conditions are,
assumed. Batch gradien t descent enjoys b etter conv
1945)) states that generalization error cannot decrease faster than O ( 1k ). Bottou
1945 ergence rates than sto chastic
gradient descent in theory. However, the Cramér-Rao bound (Cramér, 1946; Rao,
1945) states that generalization error cannot 295 decrease faster than O ( ). Bottou
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

and Bousquet (2008) argue that it therefore may not be worth worthwhile
while to pursue
an optimization algorithm that conv erges faster than O (k1 ) for machine learning
converges
and Bousquetconv
tasks—faster (2008 ) argue
convergence
ergence that it therefore
presumably corresp may
onds not
corresponds berfitting.
to ov e worthwhile
overfitting. to pursue
Moreov
Moreover, er, the
an optimization algorithm that
asymptotic analysis obscures many adv conv erges
advanan faster
antages than O
tages that sto ( ) for
stochastic machine learning
chastic gradient descent
tasks—faster conv
has after a small num ergence
numb ber of steps. With large datasets, theerfitting.
presumably corresp onds to ov ability ofMoreov
ability SGD toer, the
make
asymptotic
rapid initialanalysis
progressobscures
while ev many advthe
evaluating
aluating antages that for
gradient stochastic
only very gradient descent
few examples
haswafter
out
outw eighsa its
small
slownum ber of steps.
asymptotic conWith
conv large Most
vergence. datasets, thealgorithms
of the ability of SGD to make
describ
describeded in
rapid initial progress while ev
the remainder of this chapter achievaluating the gradient for only very few examples
achievee benefits that matter in practice but are lost
out weighs its slow asymptotic
in the constant factors obscured convergence.
by the O( 1k )Most of the algorithms
asymptotic analysis. Onedescrib
caned in
also
the remainder
trade of this chapter
off the benefits of both achiev
batch eand
benefits
sto that matter
stochastic
chastic gradient in descen
practice
descentt bybut are lost
gradually
in the constant
increasing factors hobscured
the minibatc
minibatch by the
size during the O ( ) asymptotic
course of learning.analysis. One can also
trade off the benefits of both batch and stochastic gradient descent by gradually
For more information on SGD, see Bottou (1998).
increasing the minibatch size during the course of learning.
For more information on SGD, see Bottou (1998).
8.3.2 Momen
Momentum
tum

8.3.2 sto
While Momen
stoc tum descent remains a very popular optimization strategy
chastic gradient strategy,,
learning with it can sometimes be slow. The metho method d of momentum (Poly olyak
ak, 1964)
While stochastic
is designed gradient
to accelerate descentesp
learning, remains
especially
ecially in a vthe
eryface
popular optimization
of high curv
curvature, strategy
ature, small but,
learning
consisten with it can sometimes b e slow.
consistentt gradients, or noisy gradients. The momen The metho d
momentum of
tum algorithm accumulates)
momentum (P olyak , 1964
is designed
an exp to accelerate
exponentially
onentially deca
decaying
yinglearning,
movingesp av eciallyofinpast
average
erage the gradien
face of high
gradients ts and curv
conature,
contin
tin uessmall
tinues to mo
mov butve
consisten t gradients, or noisy gradients. The momen
in their direction. The effect of momentum is illustrated in Fig. 8.5. tum algorithm accumulates
an exponentially decaying moving average of past gradients and continues to move
Formally
ormally,
in their , the momentum
direction. The effect algorithm
of momentum introduces a variable
is illustrated v that
in Fig. 8.5plays
. the role
of velocity—it is the direction and sp speed
eed at whicwhich h the parameters mo move ve through
Formallyspace.
parameter , the momentum
The velo algorithm
elocity
city is set introduces
to an exp a variable v
exponentially
onentially that
deca
decaying plays
ying av the role
average
erage of
of velocity—it is the direction and
the negative gradient. The name momentum deriv sp eed at whic h
derives the parameters
es from a physphysical mo ve through
ical analogy
analogy,, in
parameter
whic
which space.
h the negativ The v elo city
negativee gradient is a force mo is set to
moving an exp onentially deca
ving a particle through parameter ying averagespace, of
the negative gradient.
according to Newton’s la The
laws name momentum
ws of motion. Momen Momentum deriv es from a phys ical
tum in physics is mass times velocityanalogy ,
velocity..in
whic
In theh the
momennegativ
momentum tum e gradient
learning is a force
algorithm, mo
we ving
assume a particle
unit through
mass, so the parameter
v elo
elocit
cit space,
y ector
city v ectorv v
according
ma
may to Newton’s la ws of motion. Momen
y also be regarded as the momentum of the particle. A hyp tum in physics erparameter α ∈ [0 , 1).
is mass
yperparameter times velocity
In the momen
determines how tum learning
quic kly thealgorithm,
quickly contributions we assume
of previousunit gradients
mass, so the exp vonen
elocit
exponen y vector
onentially
tially decay
decay. v.
ma y
The upalso b
update e regarded
date rule is giv as
given the
en by: momentum of the particle. A hyp erparameter α [0 , 1)
determines how quickly the contributions of previous gradients ! exponentially∈decay.
X m
The update rule is given by: 1
v ← αv − ∇ θ L(f (x (i); θ), y(i) ) , (8.15)
m
1 i=1
v α v
θ ← θ + v.  L(f (x ; θ), y ) , (8.15)
(8.16)
m
← − ∇  1 Pm 
θ ulates
θ + the
v. gradient elemen (i) ( i)
(8.16)
The velocity v accum accumulates elements ts ∇θ m i=1 L(f (x ; θ ), y ) .
The larger α is relative ←to , the more previous gradien gradients ts affect!the curren
currentt direction.
v X
The velocity accum
The SGD algorithm with momen ulates the
momentum tum is given in Algorithm 8.2. f (x ; θ), y ) .
gradient elemen ts L (
The larger α is relative to , the more previous gradien ∇ ts affect the current direction.
296
The SGD algorithm with momentum is given in Algorithm 8.2.

 P 
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

20

10

−10

−20

−30
−30 −20 −10 0 10 20

Figure 8.5: Momentum aims primarily to solv solvee tw


two
o problems: poor conditioning of the
Hessian matrix and variance in the sto stocchastic gradient. Here, we illustrate ho how
w momen
momentumtum
oFigure
vercomes8.5: the
Momentum aims tw
first of these primarily
twoo problems. to solv e twcontour
The o problems:
linespdepict
oor conditioning
a quadratic of loss
the
Hessian matrix and variance in the sto chastic gradient. Here, we illustrate
function with a p o orly conditioned Hessian matrix. The red path cutting across the how momen tum
overcomes
con
contours the firstthe
tours indicates of path
thesefollow
two ed
followedproblems. The contour
by the momentum lines rule
learning depict a quadratic
as it loss
minimizes this
function with
function. At eacha pstep
o orlyalong
conditioned
the wayay,, Hessian
we drawmatrix.
an arrowThe red path
indicating thecutting across
step that the
gradient
con tours
descen indicates
descentt would tak the path follow ed by the momentum
takee at that point. We can see that a po poorlylearning rule as it minimizes
orly conditioned quadratic ob this
objective
jective
function.
lo oks like At
looks eachnarro
a long, step w
narrow along
valleytheorwcany
ay, w
canyon one draw an arrow
with steep indicating
sides. Momentum the correctly
step that trav
gradient
traverses
erses
descen
the t would
cany
canyon take at that
on lengthwise, point.
while We can
gradient see wthat
steps asteatime
poorly conditioned
moving bac
back quadratic
k and objective
forth across the
lo oks
narro
narrow like a long, narro
w axis of the cancanyw valley or cany on with steep
yon. Compare also Fig. 4.6, which sho sides. Momentum
shows correctly trav
ws the behavior of gradienterses
the cany
descen
descent on lengthwise,
t without momentum.while gradient steps waste time moving back and forth across the
narrow axis of the canyon. Compare also Fig. 4.6, which shows the behavior of gradient
descent without momentum.

297
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Previously
Previously,, the size of the step was simply the norm of the gradient multiplied
by the learning rate. No Now,w, the size of the step dep depends
ends on how large and ho how
w
Previously , the size of the step was simply the norm
aligned a sequence of gradients are. The step size is largest when man of the gradient
many m ultiplied
y successive
b y the
gradien
gradientslearning rate. No w, the size of the step
ts point in exactly the same direction. If the momen dep ends
momentumon how large
tum algorithm andalwahoys
alwaysw
aligned
observ
observes a sequence of gradients are. The step size is largest when
es gradient g , then it will accelerate in the direction of −g, until reac man y successive
reaching
hing a
gradients velocity
terminal point inwhere
exactly the
the same
size direction.
of eac
eachh step isIf the momentum algorithm always
observes gradient g , then it will accelerate in the direction of g, until reaching a
terminal velocity where the size of each||gstep || is
. − (8.17)
1−α
 g
. yperparameter 1 (8.17)
It is thus helpful to think of the momentum α hyp
1 || || erparameter in terms of 1−α . For
example, α = .9 corresp
corresponds
onds to multiplying the maxim maximum um spspeed
eed by 10 relative to
It is thus helpful to think of the momentum− h yp erparameter in terms of . For
the gradient descen
descentt algorithm.
example, α = .9 corresponds to multiplying the maximum speed by 10 relative to
Common values of α used in practice include .5, . 9, and .99 99.. Lik
Likee the learning
the gradient descent algorithm.
rate, α ma
may y also be adapted ov over
er time. Typically it begins with a small value and
Common v alues of α
is later raised. It is less imp used
ortantpractice
in
important to adapt α over.5time
include , . 9, and
than.99
to. shrink
Like the
 ovlearning
er time.
rate, α may also be adapted over time. Typically it begins with a small value and
is later raised.
Algorithm 8.2ItSto
is less
Stochasticimportant
chastic gradien
gradientto adapt α (SGD)
t descent over timewiththan to shrink  over time.
momentum
Require: Learning
Algorithm rate , momentum
8.2 Stochastic gradient descent parameter
(SGD)α.with momentum
Require: Initial parameter θ, initial velocity v.
Require: Learning
while stopping rate , momentum
criterion not met do parameter α.
Require:
SampleInitial parameter
a minibatch , initial velocity
of mθexamples from thev.training set {x(1), . . . , x (m)} with
while stopping
corresp ondingcriterion
corresponding (i). met do
targets ynot
Sample a minibatch of m examples 1fromP (i) set x ( i) , . . . , x
Compute gradient estimate: g←m ∇θ the training
i L(f (x ; θ ), y )
with
corresp onding targets
Compute velocity up y .
date: v ← αv − g
update: { }
Compute
Apply up gradient estimate:
date: θ ← θ + v
update: g L (f (x ; θ ) , y )
end while velocity update: v ←
Compute αv ∇g
Apply update: θ θ+v ← −
end while ←
We can view the momentum algorithm as simulating a particle sub subject
ject to
P
con
contin
tin
tinuous-time
uous-time Newtonian dynamics. The physical analogy can help to build
in We can
intuition
tuition for view
ho
how the momentum
w the momentumand algorithm
gradientasdescen
simulating
descent t algorithmsa particlebehav sub
ehave.e. ject to
continuous-time Newtonian dynamics. The physical analogy can help to build
The position of the particle at any poin ointt in time is giv givenen by θ (t ). The particle
intuition for how the momentum and gradient descent algorithms behave.
exp
experiences
eriences net force f (t). This force causes the particle to accelerate:
The position of the particle at any point in time is given by θ (t ). The particle
experiences net force f (t). This f force ∂ 2 the particle to accelerate:
(t) =causes θ(t). (8.18)
∂ t2

Rather than viewing this as a second-order
f (t) = θ(differen
differential
t). tial equation of the position, (8.18)
∂t
we can introduce the variable v(t) representing the velo elocit
cit
cityy of the particle at time
Rather than viewing
t and rewrite this as adynamics
the Newtonian second-order differentialdifferential
as a first-order equation of the position,
equation:
we can introduce the variable v(t) representing the velocity of the particle at time

t and rewrite the Newtonian dynamics v(t) =as aθfirst-order
(t), differential equation:(8.19)
∂t

v(t) =298 θ(t), (8.19)
∂t
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS


f (t) = v(t). (8.20)
∂t

The momentum algorithm then consists f (t) = ofvsolving (t). the differential equations via
(8.20)
numerical sim ulation. A simple numerical∂metho
simulation. t
method d for solving differendifferential tial equations
The momentum
is Euler’s metho
method, algorithm
d, whic
which then consists
h simply consistsofofsolving simulating the differential
the dynamics equations
definedvia by
n umerical sim ulation. A simple numerical
the equation by taking small, finite steps in the direction of eac metho d for solving differen
each tial
h gradient. equations
is Euler’s method, which simply consists of simulating the dynamics defined by
the This
equationexplains the basic
by taking formfinite
small, of thesteps momen
momentum tumdirection
in the up
update,
date, but of eacwhat sp
specifically
ecifically are
h gradient.
the forces? One force is prop proportional
ortional to the negative gradien gradientt of the cost function:
−∇θThisJ ( θ)explains
. This force the basic
pushes formtheofparticle
the momen tum up
downhill date,the
along butcost
what specifically
function are
surface.
the forces?
The gradient One force isalgorithm
descent proportional would to the
simplynegative
take agradien
singlet step
of thebasedcost function:
on each
gradienJ
gradient,( θt, but the Newtonian scenario used by the momentum algorithm surface.
) . This force pushes the particle downhill along the cost function instead
The
−∇ gradient
uses this forcedescent to alteralgorithm
the velocity would of the simply take W
particle. a esingle step based
can think of the on each
particle
gradien
as beingt,like butathe ho
hock Newtonian
ck
ckeyey puck slidingscenario downused anby icythe momentum
surface. Whenev
Whenever algorithm instead
er it descends a
uses this force to alter the
steep part of the surface, it gathers sp velocity of
speedthe particle.
eed and contin W
continues e can think of the
ues sliding in that direction particle
as
un b
until eing like a ho ck ey
til it begins to go uphill again.puck sliding down an icy surface. Whenever it descends a
steep part of the surface, it gathers speed and continues sliding in that direction
untilOneit bother
eginsforce
to goisuphill
necessary
necessary. . If the only force is the gradien
again. gradientt of the cost function,
then the particle migh mightt never come to rest. Imagine a ho hockck
ckeyey puck sliding down
One other force is necessary . If the only
one side of a valley and straight up the other side, oscillating back force is the gradien t of and
the cost
forthfunction,
forev
forever,
er,
then the particle migh t never come to rest. Imagine
assuming the ice is perfectly frictionless. To resolve this problem, we add one a ho ck ey puck sliding down
one side
other of a prop
force, valley and straight
proportional
ortional to −v(up t). the other side,
In physics oscillating, this
terminology
terminology, backforce
and forth
corresp forev
correspondsondser,
assuming
to viscous the drag,iceasisifpthe
erfectly frictionless.
particle must pushTothrough resolveathis problem,
resistant medium we add suchone as
other force, prop ortional to v (t ) .
syrup. This causes the particle to gradually lose energy ovIn physics terminology ,
over this force
er time and ev corresp
even
en onds
entually
tually
to
con
convviscous
verge to a lo drag, as
local if the
cal minimum. particle
− m ust push through a resistant medium such as
syrup. This causes the particle to gradually lose energy over time and eventually
convWh
Why
ergey doto awe −v (t) and viscous drag in particular? Part of the reason to
use minimum.
local
use −v (t) is mathematical con conv venience—an integer pow ower er of the velocity is easy
Wh y do
to work with. How we use
However, v (t ) and viscous
ever, other physical systems ha drag in particular?
ve otherPart
have kinds of of
thedrag
reason basedto
use v
on other in( t) is mathematical
integer
teger pow con v enience—an
−ers of the velocity
owers integer p
velocity.. For example, a particle tra ow er of the velocity
trav is
veling through easy
to
the −w ork
air exp with.
experiences How ever,
eriences turbulent otherdrag,physical withsystems
force prop have
proportionalother to
ortional kindsthe of drag of
square based
the
on
velo other
elocit
cit
city in teger p ow
y, while a particle mo ers of the
moving velocity . F or
ving along the ground exp example, a particle
experiences tra veling
eriences dry friction, with a through
the air exp eriences turbulent drag, with
force of constant magnitude. We can reject each of these options. force prop ortional to theTurbulent
square ofdrag, the
vprop
elo city ,
proportional while a particle mo
ortional to the square of the velo ving alongelocitthe
cit
city ground exp eriences dry
y, becomes very weak when the velocity is friction, with a
force ofItconstant
small. is not pow magnitude.
owerful
erful enough We can rejectthe
to force each of these
particle options.
to come Turbulent
to rest. drag,
A particle
proportional
with a non-zero to the square
initial of thethat
velocity veloexp cityeriences
, becomes
experiences very
only the weak
force when the velocity
of turbulen
turbulent t drag is
small.
will mo
move It is
ve aw
awaynot powerful
ay from enough
its initial to force
position the particle
forever, with thetodistance
come tofrom rest.the A starting
particle
with a non-zero initial velocity that
ointt growing like O(log t). We must therefore use a low
poin exp eriences only the
lower force
er pow of
ower turbulen
er of the velocityt drag.
velocity.
will
If wemo ve aw
use a payow from
owerer ofits initial
zero, position
represen
representing tingforever, with the
dry friction, thendistance
the forcefromisthe to
tooostarting
strong.
point growing
When the forcelike dueOto (log
thet).gradient
We must of therefore
the cost functionuse a low isersmall
power butofnon-zero,
the velocity the.
If we use
constan
constant a pow
t force er to
due of friction
zero, represen
can cause tingthe dryparticle
friction, to then
comethe forcebefore
to rest is tooreac strong.
reaching
hing
When
a lo
local the
cal minim force
minimum. due to the
um. Viscous drag av gradient avoids of the cost function is small
oids both of these problems—it is weak enough but non-zero, the
constant force due to friction can cause the particle to come to rest before reaching
a local minimum. Viscous drag avoids b299 oth of these problems—it is weak enough
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

that the gradient can contin


continue
ue to cause motion until a minim
minimumum is reached, but
strong enough to preven
preventt motion if the gradient do
does
es not justify moving.
that the gradient can continue to cause motion until a minimum is reached, but
strong enough to prevent motion if the gradient does not justify moving.
8.3.3 Nestero
Nesterov
v Momen
Momentum
tum

8.3.3
Sutsk
Sutskev
ev erNestero
ever et al. (2013v Momen
) introduced tuma varian
ariantt of the momentum algorithm that was
inspired by Nesterov’s accelerated gradien gradientt metho
method d (Nesterov, 1983, 2004). The
Sutsk
up
updateev er et al. ( 2013 ) introduced
date rules in this case are given by: a varian t of the momentum algorithm that was
inspired by Nesterov’s accelerated gradient method (Nesterov, 1983, 2004). The
" #
update rules in this case are given 1 X
m
by:  
v ← αv − ∇θ L f (x (i); θ + αv), y (i) , (8.21)
m
1 i=1
v←θ
θ αv+ v, L f (x ; θ + αv), y , (8.21)
(8.22)
m
← − ∇
where the parametersθ θα + and
v,  pla playy a similar role as in the standard momentum (8.22)
metho
method.d. The difference ← betw
etween "
een Nestero
Nesterov v momentum and standard # momentum is
where the parameters α and  pla yXa similar role as 
where the gradient is ev evaluated.
aluated. With Nestero
Nesterov v momen
momentum tum the gradient momentum
in the standard is ev
evaluated
aluated
metho d. The
after the curren difference b etw een
currentt velocity is applied. Th Nestero v
Thus momentum and standard
us one can interpret Nestero Nesterovmomentum
v momentum is
where the gradient
as attempting is evaaluated.
to add corr With
orreection Nestero
factor to vthemomen tum the
standard gradient
metho
method is evaluated
d of momentum.
aftercomplete
The the curren t velocity
Nestero
Nesterov is applied.algorithm
v momentum Thus oneiscan interpret
presen
presentedted inNestero v momentum
Algorithm 8.3.
as attempting to add a correction factor to the standard method of momentum.
In the con
conveve
vex x batch gradient case, Nestero Nesterov v momen
momentum tum brings the rate of
The complete Nesterov momentum algorithm is presented in Algorithm 8.3.
con
convvergence of the excess error from O(1 /k) (after k steps) to O(1
(1/k /k2) as shown
(1/k
In the con
by Nestero
Nesterov v (ve1983x batch gradient case,
). Unfortunately
Unfortunately, , in Nestero
the sto vchastic
momengradien
stochastic tum brings
gradient t case,theNestero
rate of
Nesterov v
con v
momenergence
momentumtum do of
does the excess
es not improv error from O (1
improvee the rate of conv /k ) (after
convergence.
ergence. k steps) to O(1 /k ) as shown
by Nesterov (1983). Unfortunately, in the stochastic gradient case, Nesterov
momentum do
Algorithm 8.3es Sto
notchastic
improvgradien
Stochastic e the rate
gradient of conv
t descent ergence.
(SGD) with Nesterov momentum
Require: Learning
Algorithm rate , momentum
8.3 Stochastic parameter
gradient descent (SGD)α.with Nesterov momentum
Require: Initial parameter θ, initial velocity v.
Require: Learning
while stopping rate , momentum
criterion not met do parameter α.
Require:
SampleInitial parameter
a minibatch , initial velocity
of mθexamples from thev.training set {x(1), . . . , x (m)} with
while stopping
corresp
corresponding criterion
onding lab not
els y(i) .
labels met do
Sample a minibatch
Apply interim upupdate:of mθ˜ examples
date: ← θ + αvfrom the training set x , . . . , x with
corresponding labels . P
Compute gradient (atyinterim point): g ← m1 ∇θ̃ i L(f{(x (i); θ˜), y (i)) }
Apply interim update: ˜
θ v θ←+ααvv− g
Compute velocity up
update:
date:
Compute
Apply up date: θ ← θ + v← point): g
gradient
update: (at interim L(f (x ; θ˜), y )
Compute
end while velocity update: v αv g ← ∇
Apply update: θ θ+v ← −
end while ←
P

300
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

8.4 Parameter Initialization Strategies

8.4 optimization
Some Parameter Initialization
algorithms are not iterative Strategies
by nature and simply solve for a
solution point. Other optimization algorithms are iterative by nature but, when
Some optimization
applied to the righ algorithms
rightt class are not iterative
of optimization problems,bycon nature
verge and
converge simply solve
to acceptable for a
solutions
solution p oint.
in an acceptable amoun Other optimization algorithms are iterative b
amountt of time regardless of initialization. Deep learning training y nature but, when
applied to the
algorithms rightdo
usually class
notof havoptimization
have e either of theseproblems, converge
luxuries. Training to acceptable
algorithmssolutions
for deep
in an acceptable
learning mo dels are usually iterative in nature and thus require the user totraining
models amoun t of time regardless of initialization. Deep learning sp
specify
ecify
algorithms
some initial poin usually do not hav e either of these luxuries.
ointt from which to begin the iterations. Moreov T raining algorithms
Moreover, for
er, training deep deep
learning
mo
models mo dels
dels is a sufficien are
sufficiently usually iterative in nature and thus
tly difficult task that most algorithms are strongly require the useraffected
to specifyby
some
the initial
choice of pinitialization.
oint from which The to begin
initial the can
point iterations.
determine Moreov
whether er, training
the algorithmdeep
models
con
conv verges is aatsufficien
all, withtly some
difficult task pthat
initial ointsmost
being algorithms
so unstable are strongly
that the affected
algorithm by
the
encounchoice
encounters ters of initialization.
numerical difficultiesThe initial
and fails point can determine
altogether. whether do
When learning theesalgorithm
does conv
converge,
erge,
con v erges
the initial poin at all, with some initial p oints
ointt can determine how quickly learning conv being so unstable
converges that the algorithm
erges and whether it
encoun
con
conv vergesters tonumerical
a pointdifficulties
with highand or fails
low altogether.
cost. Also,When points learning does converge,
of comparable cost
the initial
can hav p oin t can determine how quickly learning
havee wildly varying generalization error, and the initial point can affect the conv erges and whether it
converges to aaspw
generalization oint
ell. with high or low cost. Also, points of comparable cost
can have wildly varying generalization error, and the initial point can affect the
Mo
Moderndern initialization strategies are simple and heuristic. Designing improv improved ed
generalization as well.
initialization strategies is a difficult task because neural netw networkork optimization is
not Moyetdernwell initialization
understo
understoood. Most strategies are simple
initialization and heuristic.
strategies are based Designing
on achieving improvsomeed
initialization
nice prop
properties
erties strategies
when theis netw a difficult
networkork is task becauseHow
initialized. neural
ever,netw
However, we do orknotoptimization
hav
havee a go gooois
d
not yet well understo
understanding of whic whicho d. Most initialization
h of these propproperties strategies
erties are preserv
preserved are based
ed under whic on
which achieving
h circumstances some
nice prop
after ertiesbwhen
learning egins the
to pro netw
proceed.ork is
ceed. A initialized.
further difficultyHowever, we do
is that somenotinitial
have apoin go ots
ointsd
understanding
ma
may y be beneficial of whic
fromh ofthethese prop
viewp
viewpoint erties
oint of are preserved under
optimization which circumstances
but detrimental from the
after
viewp
viewpoin learning
oin b egins to pro ceed. A further difficulty
ointt of generalization. Our understanding of how the initial poin is that some initial points
ointt affects
ma y b e b eneficial
generalization is esp from
especially the viewp
ecially primitiv
primitive, oint of optimization but detrimental
e, offering little to no guidance for how to select from the
viewp oint of
the initial poingeneralization.
oint.t. Our understanding of how the initial point affects
generalization is especially primitive, offering little to no guidance for how to select
Perhaps the only prop propertyerty known with complete certaint certainty y is that the initial
the initial point.
parameters need to “break symmetry” b betw
etw
etweeneen differen
differentt units. If tw two o hidden
P erhaps the
units with the same activ only prop ation function are connected to the same inputs,initial
erty
activation known with complete certaint y is that the then
parameters
these units mustneed hav to “break
have e different symmetry” between differen
initial parameters. If they t units.
hav
havee the If same
two hidden
initial
units with the
parameters, then same activation function
a deterministic learning are connected
algorithm applied to the
to asame inputs, then
deterministic cost
these
and mo units
model must
del will constanhav e
constantly different
tly upupdate initial parameters. If they
date both of these units in the same way. Ev hav e the same
Even initial
en if the
parameters,
mo
model then a deterministic
del or training algorithm is capable of using stolearning algorithm applied
stochasticit
chasticit
chasticity to a deterministic
y to compute differen costt
different
anddates
up
updatesmodel forwill constan
differen
different tly up
t units date
(for both of ifthese
example, one units
trainsinwith the drop
sameout),
dropout),way. itEv isenusually
if the
mo
b estdeltoorinitialize
trainingeach algorithm
unit tois compute
capable ofa usingdifferen
differentsto
t chasticit
functiony from to compute
all of the differen
othert
updatesThis
units. for differen
may help t units
to mak(fore example,
make sure thatif no oneinput
trainspatterns
with drop areout),
lost itinistheusually
null
b est to initialize each unit
space of forward propagation and no gradien to compute a differen t function from all
gradientt patterns are lost in the null space of the other
units.
of This may helpThe
back-propagation. to makgoale of sure that eac
having nohinput
each patterns aare
unit compute lost in function
different the null
space of forward propagation and no gradient patterns are lost in the null space
of back-propagation. The goal of having 301 each unit compute a different function
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

motiv
motivates ates random initialization of the parameters. We could explicitly searc search h
for a large set of basis functions that are all mutually different from each other,
motiv
but thisates random
often incursinitialization
a noticeable of the parameters.
computational cost. W e could
For example,explicitly
if we ha searc
have
ve ath
for a large
most as many set outputs
of basis functions
as inputs, that are all
we could mutually
use Gram-Schmidtdifferentorthogonalization
from each other,
but this often incurs a noticeable
on an initial weight matrix, and be guaran computational
guaranteed cost. F or example,
teed that each unit computes if we ha a ve at
very
most
differen
different ast many
function outputs
from eac as hinputs,
each we could
other unit. use Gram-Schmidt
Random initialization from orthogonalization
a high-en
high-entrop
trop
tropyy
on an initial w eight matrix, and b e guaran teed that
distribution over a high-dimensional space is computationally cheaper and unlikely each unit computes a very
differen
to assign t function
an
any y units from each other
to compute theunit.
sameRandom
functioninitialization
as each other. from a high-entropy
distribution over a high-dimensional space is computationally cheaper and unlikely
Typically
ypically,, we set the biases for each unit to heuristically chosen constants, and
to assign any units to compute the same function as each other.
initialize only the weigh weights ts randomly
randomly.. Extra parameters, for example, parameters
enco T
encoding ypically , we set the
ding the conditional variance biases for each unit to heuristically
of a prediction, are usually chosen
set toconstants, and
heuristically
cinitialize only themweigh
hosen constants uc
uch ts randomly
h like the biases . Extra
are. parameters, for example, parameters
encoding the conditional variance of a prediction, are usually set to heuristically
We almost alw alwa ays initialize all the weights in the mo model del to values dra drawnwn
chosen constants much like the biases are.
randomly from a Gaussian or uniform distribution. The cchoice hoice of Gaussian
We almost
or uniform always initialize
distribution do
doeses notall seemthe to
weights
matterinvery the muc
model
much, h, butto vhas
alues
notdra wn
been
randomly
exhaustiv
exhaustively from a Gaussian or uniform distribution.
ely studied. The scale of the initial distribution, how The c hoice
however, of
ever, do Gaussian
does
es hav
havee a
or uniform distribution do es not
large effect on both the outcome of the optimization pro seem to matter v ery muc
procedure h, but has not
cedure and on the ability been
exhaustiv
of the net netw ely
w orkstudied. The scale of the initial distribution, however, does have a
to generalize.
large effect on both the outcome of the optimization procedure and on the ability
Larger
of the netwinitial
ork to weigh
weights ts will yield a stronger symmetry breaking effect, helping
generalize.
to avavoidoid redundan
redundantt units. They also help to av avoid
oid losing signal during forward or
bac Larger
back-propagation initial weigh ts
k-propagation through the linear compwill yield a stronger
componen
onen symmetry
onentt of each la ybreaking
lay er—largereffect,
valueshelping
in the
to avoidresult
matrix redundan t units.
in larger They of
outputs also help to
matrix avoid losing signal
multiplication. Initialduring
weigh forward
eights
ts that are or
bac
to
tooo k-propagation
large ma may y, how through
however, the linear
ever, result in explo comp
exploding
dingonen t of each
values during layforward
er—larger values in the
propagation or
matrix
bac result
back-propagation. in larger
k-propagation. In recurrent netw outputs of matrix
networks, multiplication.
orks, large weigh
weights Initial w eigh ts that
ts can also result in chaos are
to
(suco
(such large ma y , how ever, result in explo ding
h extreme sensitivity to small perturbations of the input v alues during forward thatpropagation
the behaehavioror
vior
bacthe
of k-propagation.
deterministicInforw recurrent
forward networks, pro
ard propagation large weigh
procedure
cedure ts ears
app can also
appears result T
random). ino chaos
To some
(suc
exten h
extent, extreme
t, the explo sensitivity
exploding to small p erturbations of the input
ding gradient problem can be mitigated by gradient clipping that the b eha vior
of the deterministic
(thresholding the values forw ofard
the propagation
gradients before procedure
performing appears random).
a gradien
gradient t descenTot step).
descent some
exten t,
Large weigh the
weights explo ding gradient problem can b e mitigated
ts may also result in extreme values that cause the activ by gradient
activation clipping
ation function
(thresholding
to saturate, causing the values of the gradients
complete before performing
loss of gradient a gradient units.
through saturated descentThese
step).
Largeeting
comp
competing weighfactors
ts may determine
also resultthe in extreme values
ideal initial that
scale of cause
the weighthe activ
weights.ts. ation function
to saturate, causing complete loss of gradient through saturated units. These
comp The etingpersp
erspectiv
ectiv
ectives
factors es of regularization
determine and optimization
the ideal initial scale of the weigh can give
ts. very different
insigh
insights ts in
into
to how we should initialize a net network.
work. The optimization persp erspective
ective
The pthat
suggests erspthe ectiv es of regularization
weights should be largeand optimization
enough to propagate can information
give very different
success-
insigh
fully ts insome
fully,, but to how we should initialize
regularization concerns aencourage
network. making
The optimization
them smaller. perspTheective
use
suggests that the w eights
of an optimization algorithm such as sto should b e large enough
stochastic to propagate
chastic gradien information
gradientt descent that makes small success-
fully
incremen, but
incremental some regularization
tal changes to the weigh concerns
weights encourage
ts and tends to halt making themthat
in areas smaller. The use
are nearer to
of aninitial
the optimization
parameters algorithm
(whether suchdueastosto chasticstuck
getting gradien
in at descent
region ofthat low makes
gradien
gradient,small
t, or
incremental changes to the weights and tends to halt in areas that are nearer to
the initial parameters (whether due to getting 302 stuck in a region of low gradient, or
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

due to triggering some early stopping criterion based on overfitting) expresses a


prior that the final parameters should be close to the initial parameters. Recall
due toSec.
from triggering
7.8 thatsome gradienearly
gradient stoppingwith
t descent criterion
early based
stopping on oisverfitting)
equiv alentexpresses
equivalent to weigh
weight at
prior
deca
decay y that
for somethe finalmo
models.parameters
dels. shouldcase,
In the general be close
gradien
gradientto tthe initial
descen
descent parameters.
t with early stopping Recallis
from Sec. 7.8
not the same as weigh that gradien
weightt deca
decay t descent
y, but dodoeswith early
es provide a lo stopping
loose is equiv alent
ose analogy for thinking ab to weigh
outt
about
decaeffect
the y for some models. In the
of initialization. We general
can think case,of gradien t descen
initializing thetparameters
with early stopping
θ to θ 0 asis
not
b eingthesimilar
same as to weigh
imp
imposingt decaayGaussian
osing , but does prior
providep(θa) lo ose mean
with analogy θ0 .forFrom
thinking
this ab out
point
theview,
of effectitofmakes initialization. We canθ think
sense to choose of initializing the parameters θ to θ as
0 to b e near 0. This prior says that it is more
b eing
lik
likely similar to imp osing a Gaussian
ely that units do not interact with each other prior p(θ )thanwiththatmean theyθ .doFrom this pUnits
interact. oint
ofteract
in view, it
interact onlymakesif the sense to choose
likelihoo
likelihood d term θ of to the
be near
ob 0. This
objectiv
jectiv
jective prior says
e function that itaisstrong
expresses more
likely that units
preference for them do not to interact
interact.with On each other hand,
the other than that if wthey do interact.
e initialize θ 0 to Units
large
in teract only if
values, then our prior sp the likelihoo
specifies d term
ecifies whic
which of the ob jectiv e function
h units should interact with eac expresses
each a strong
h other, and
preference
ho
how for them
w they should interact. to interact. On the other hand, if w e initialize θ to large
values, then our prior specifies which units should interact with each other, and
Some heuristics are av available
ailable for cho hoosing
osing the initial scale of the weights. One
how they should interact.
heuristic is to initialize the weights of a fully connected lay layerer with m inputs and
Some heuristics
n outputs by sampling each weighare av ailable
weightt from U (− √m , √ m ), whileofGlorot
for c ho osing the
1 initial
1 scale the weights.
and Bengio One
heuristic is to initialize the weights of a fully connected layer with m inputs and
(2010) suggest using the normalize normalized d initialization
n outputs by sampling each weight from U ( , ), while Glorot and Bengio
(2010) suggest using the W normalize 6 −, √
d initialization 6
i,j ∼ U (−√ ). (8.23)
m+n m+n
6 6
This latter heuristic is designed W Uto ( compromise , bet ). (8.23)
m + n m etw +wneen the goal of initializing
all lay
layers
ers to ha haveve the same activ ∼ ation−√ variance
activation √ and the goal of initializing all
This
la
lay latter
yers to ha heuristic
have is
ve the same gradiendesigned to compromise
gradientt variance. The betform
ween
formula ulatheis goal
deriv
derived of
edinitializing
using the
all lay ers to ha
assumption that the netw ve the same
network activ ation v ariance and the goal
ork consists only of a chain of matrix multiplications, of initializing all
layersnotononlinearities.
with have the sameReal gradien
neuralt vnetw
ariance.
networks
orks ob The formula
obviously
viously is deriv
violate thised using the
assumption,
assumption
but that thedesigned
many strategies network for consists only mo
the linear of del
modela chain
perform of matrix
reasonablymultiplications,
well on its
with no nonlinearities.
nonlinear counterparts. Real neural netw orks ob viously violate this assumption,
but many strategies designed for the linear model perform reasonably well on its
Saxe et al. (2013) recommend initializing to random orthogonal matrices, with
nonlinear counterparts.
a carefully chosen scaling or gain factor g that accoun accounts ts for the nonlinearit
nonlinearity y applied
Saxe
at each lay et
layer. al. ( 2013 ) recommend initializing to random
er. They derive specific values of the scaling factor for different typ orthogonal matrices, with
ypes
es of
a carefully
nonlinear activ c hosen
activation scaling or gain factor g that accoun ts
ation functions. This initialization scheme is also motiv for the nonlinearit
motivatedy applied
ated by a
at
mo each
model lay er.
del of a deep net They derive
network specific v alues of the scaling factor
work as a sequence of matrix multiplies without nonlinearities. for different types of
nonlinear
Under suc
such hactiv
a mo ation
model, functions.
del, this This initialization
initialization sc
scheme
heme guaran scheme
guarantees tees that is also motivnum
the total atedber
numb byofa
model of iterations
training a deep netrequiredwork as to a sequence
reac
reach h convof ergence
matrix multiplies
convergence is indep
independent without
endent nonlinearities.
of depth.
Under such a model, this initialization scheme guarantees that the total number of
Increasing
training iterations the scaling
requiredfactor to reacg hpushes the netw
convergence network
is ork
indep tow
towardard the
endent regime where
of depth.
activ
activations
ations increase in norm as they propagate forward through the net netwwork and
Increasing
gradien
gradients ts increase the scaling
in normfactor as theyg pushes
propagate the netw
backw ork
backward. ard.towSussillo
ard the(regime2014) show where
showeded
activ ations increase in norm as they propagate
that setting the gain factor correctly is sufficient to train netw forward through the
networks net w ork
orks as deep as and
gradien
1,000 lalayts
yers, without needing to use orthogonal initializations. A (key
increase in norm as they propagate backw ard. Sussillo 2014insight
) showed of
that setting the gain factor correctly is sufficient to train networks as deep as
1,000 layers, without needing to use orthogonal 303 initializations. A key insight of
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

this approach is that in feedforw feedforward ard netw


networks,
orks, activ
activations
ations and gradients can grow
or shrink on eac each h step of forw forward ard or back-propagation, follo following
wing a random walk
this
beha approach
ehavior. is that in
vior. This is because feedforwfeedforw ard
feedforward netw
ard netw orks,
networksorks use a differentgradients
activ ations and weigh can grow
eightt matrix at
or
eac
eachshrink
h lay
layer.on eac h step of forw ard or back-propagation,
er. If this random walk is tuned to preserve norms, then feedforward follo wing a random w alk
b
neteha
netw vior.
works can Thismostly
is becauseav oidfeedforw
avoid ard netw
the vanishing andorks use ding
explo
explodinga different
gradien
gradients wtseigh t matrix
problem thatat
each lay
arises er. the
when If this
same randomeightt w
weigh alk is is
matrix tuned
used to at preserve
eac
each h step,norms,
describ
described then
ed infeedforward
Sec. 8.2.5.
networks can mostly avoid the vanishing and exploding gradients problem that
Unfortunately
Unfortunately,, these optimal criteria for initial weigh weights ts often do not lead to
arises when the same weight matrix is used at each step, described in Sec. 8.2.5.
optimal performance. This may be for three different reasons. First, we may
Unfortunately
be using the wrong , these optimal may
criteria—it criteria
not for initial bweigh
actually ts oftentodopreserv
e beneficial not lead
preserve e the to
optimal
norm of aperformance.
signal throughout This may be fornet
the entire three
netw work.different
Second,reasons.
the prop First, imp
properties
erties we osed
may
imposed
be initialization
at using the wrong may criteria—it
not persist after may not actually
learning has bbegun
e beneficial
to pro to preserv
proceed.
ceed. Third, e the
the
norm of a signal throughout
criteria might succeed at improving the sp the entire net w
speedork. Second, the prop
eed of optimization but inadverten erties imp
inadvertently osed
tly
at initialization may not p ersist after learning has b
increase generalization error. In practice, we usually need to treat the scale of theegun to pro ceed. Third, the
wcriteria
eigh
eightsts asmight succeed at improving
a hyperparameter whose optimalthe speed value oflies
optimization
somewherebut inadverten
roughly near but tly
increase
not exactlygeneralization
equal to theerror. In practice,
theoretical we usually need to treat the scale of the
predictions.
weights as a hyperparameter whose optimal value lies somewhere roughly near but
One dradrawback
wback to scaling rules that set all of the initial weigh weights ts to ha haveve the same
not exactly equal to the theoretical 1 predictions.
standard deviation, such as m , is that every individual weigh
√ weightt becomes extremely
One drawback to scaling rules that set all of the initial weights to have the same
small when the lay layers
ers become large. Martens (2010) introduced an alternative
standard deviation, such as , is that every individual weight becomes extremely
initialization scheme called sp sparse
arse initialization in which eac each h unit is initialized to
small
ha
hav when the
ve exactly layers bweigh
k non-zero ecomets.large.
weights. The idea Martens (2010the
is to keep ) introduced
total amount an alternative
of input to
initialization
the unit indep scheme
independent called
endent from the num sp arse
numb initialization in which eac h unit
ber of inputs m without making the magnitude is initialized to
of e exactly kweigh
havindividual non-zero
weight t elemen weigh
elements ts. Thewith
ts shrink ideamis. to keepinitialization
Sparse the total amount helps of toinput
achiev
achieve toe
the unit
more diversit indep
diversity endent from the num b er of
y among the units at initialization time. How inputs m without making
However, the
ever, it also impmagnitude
imposes
oses
aofvery
individual
strongweigh
priorton elemen ts shrink
the weigh
weights withare
ts that m.chosen
Sparsetoinitialization
ha
have helps to achiev
ve large Gaussian values. e
more diversit
Because y among
it takes a long the timeunits at initialization
for gradient descen
descentt to time.
shrinkHow ever, it also
“incorrect” largeimp oses
values,
a very
this strong priorsc
initialization on
scheme
heme thecan weigh ts that
cause are chosen
problems for unitsto ha ve large
such Gaussian
as maxout units values.
that
Because
ha
hav it takes a long time
ve several filters that must be carefully cofor gradient descen t to
coordinated shrink “incorrect”
ordinated with each other. large values,
this initialization scheme can cause problems for units such as maxout units that
haveWhenseveral computational
filters that mresourcesust be carefullyallo
allow w it, it is usually
coordinated a go
with goood idea
each other. to treat the
initial scale of the weigh weights ts for eaceachh lay
layer
er as a hyperparameter, and to cho hoose
ose these
When computational resources
scales using a hyperparameter search algorithm describ allo w it, it is usually
described a go o d idea to
ed in Sec. 11.4.2, such treat the
initial scale of the weigh ts for eac h lay er as a hyperparameter,
as random search. The choice of whether to use dense or sparse initialization and to cho ose these
scales
can alsousing a hyperparameter
be made a hypyperparameter.
erparameter.search Alternately
algorithm describ
Alternately, , one can ed in manSec.
manually
ually 11.4.2
search, suchfor
as random search.
the best initial scales. A go The choice
goo of whether
od rule of thum thumb to use dense or sparse initialization
b for choosing the initial scales is to
can
lo
look also b e made a h yp erparameter.
ok at the range or standard deviation of activ Alternately , one can
activations
ations manuallyon
or gradients search
a singlefor
the b
minibatc est
minibatch initial scales.
h of data. If the weigh A go o d
eights rule
ts are toof thum b for choosing
tooo small, the range of activ the initial
activations scales
ations across the is to
lo ok at the
minibatc
minibatch range
h will shrink or standard
as the activ deviation
activations of activations
ations propagate forward or gradients
through the on netw
a single
network.
ork.
minibatc
By rep h
repeatedlyof data. If the w
eatedly identifying the first layeigh ts are to
layer o small, the range of
er with unacceptably small activ activ ations
activationsacross
ations and the
minibatch its
increasing willweigh
shrink
weights, asisthe
ts, it activations
possible to ev propagate
eventually
entually forward
obtain a netw through
networkork with the network.
reasonable
By repeatedly identifying the first layer with unacceptably small activations and
increasing its weights, it is possible to eventually obtain a network with reasonable
304
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

initial activ
activations
ations throughout. If learning is still to too o slow at this point, it can be
useful to lo look
ok at the range or standard deviation of the gradients as well as the
initialations.
activ activations
activations. This pro throughout.
procedure
cedure can If in
learning
principle is still too slow at and
be automated this ispoint, it canless
generally be
useful to look at costly
computationally the rangethanor standard deviation
hyperparameter of the gradients
optimization based on as well as the
validation set
activ ations. This pro cedure can
error because it is based on feedback from the beha in principle be automated
ehavior and is
vior of the initial mogenerally
model less
del on a
computationally
single batch of data, costly than
rather hyperparameter
than on feedback from optimization
a trained mo based
del on
model validation
on the validationset
errorWhile
set. because long it used
is based on feedback
heuristically
heuristically, , thisfrom proto
protocolthecolbhas
ehavior of the
recently initial
been sp model more
specified
ecified on a
single batch
formally andofstudied
data, rather than onand
by Mishkin feedback
Matasfrom (2015 a ).trained model on the validation
set. While long used heuristically, this protocol has recently been specified more
So far w wee ha haveve focused on the initialization of the w weights.
eights. Fortunately
ortunately,,
formally and studied by Mishkin and Matas (2015).
initialization of other parameters is typically easier.
So far we have focused on the initialization of the weights. Fortunately,
The approac
approach h for setting the biases must be co coordinated
ordinated with the approach
initialization of other parameters is typically easier.
for settings the weigh weights. ts. Setting the biases to zero is compatible with most weight
initialization schemes. setting
The approac h for There are theabiases must bewhere
few situations coordinated
we mamay ywith the approach
set some biases to
for settings
non-zero the weights. Setting the biases to zero is compatible with most weight
values:
initialization schemes. There are a few situations where we may set some biases to
non-zero
• If a v alues:
bias is for an output unit, then it is often beneficial to initialize the bias to
obtain the righ rightt marginal statistics of the output. To do this, we assume that
If a bias is
the initial weighfor antsoutput
eights unit,enough
are small then itthat is often
the beneficial
output of to theinitialize the bias to
unit is determined
obtain the righ t marginal statistics of
• only by the bias. This justifies setting the bias to the inv the output. T o do this,
inverse w e assume
erse of the activ that
activation
ation
the initialapplied
function weightstoare thesmall enough
marginal that theofoutput
statistics of the in
the output unittheistraining
determined set.
only by the bias. This justifies
For example, if the output is a distribution ov setting the bias
over to the inv erse of the
er classes and this distributionactiv ation
function
is a highly sk applied
skew
ew
ewed to the marginal statistics
ed distribution with the marginal of the probability
output in the training
of class i givset.
given
en
F or example, if the output is a distribution ov er
by element ci of some vector c, then we can set the bias vector b by solving classes and this distribution
is a equation
the highly skew ed distribution
softmax
softmax( ( b) = c . This with applies
the marginalnot only probability
to classifiers but ialso
of class givento
b
moy element
models c of some vector c , then
dels we will encounter in Part III, such as auto we can set the
autoencobias
enco
encoders vector b b
ders and Boltzmann y solving
the
mac equation
machines.
hines. These mo softmax
models( b) =
dels hav c .
havee lay This
ers whose output should resemblebut
layers applies not only to classifiers thealso
inputto
mo dels we will encounter in Part I I I , such
data x, and it can be very helpful to initialize the biases of such layas auto enco ders and Boltzmann
layers
ers to
mac
matc
matchhines.
h the Thesemarginal models have layers
distribution overwhose x. output should resemble the input
data x, and it can be very helpful to initialize the biases of such layers to
• Sometimes
match the marginal we ma may y distribution
want to choose over x the
. bias to avoid causing too m much
uch
saturation at initialization. For example, we ma may y set the bias of a ReLU
Sometimes
hidden unit to we0.1 marather
y want thanto 0choose
to av oidthe
avoid bias to the
saturating avoid
ReLU causing too much
at initialization.
saturation at initialization.
• This approach is not compatible with weigh F or example, weightt initialization schemesathat
w e ma y set the bias of ReLU do
hidden
not exp unit
expect to 0.1 rather than 0 to av oid saturating
ect strong input from the biases though. For example, it is not the ReLU at initialization.
This approachfor
recommended is not
use compatible
with random withwalk weigh t initialization
initialization schemes
(Sussillo , 2014that). do
not expect strong input from the biases though. For example, it is not
• Sometimes
recommended a unit controls
for use whether walk
with random otherinitialization
units are able to participate
(Sussillo , 2014). in a
function. In suc such h situations, we ha hav ve a unit with output u and another unit
Sometimes
h ∈ [0
[0,, 1] a unit
1],, then we controls
can viewwhether
h as a gate other units
that are able whether
determines to participateuh ≈ 1inor a
function. In suc h situations, we ha v e a unit with
• uh ≈ 0. In these situations, we want to set the bias for h so that h ≈ 1 most output u and another unit
h [0 , 1], then we can view h as a gate that determines whether uh 1 or
uh∈ 0. In these situations, we want 305 to set the bias for h so that h 1 most

≈ ≈
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

of the time at initialization. Otherwise u do does


es not ha
hav
ve a chance to learn.
For example, Jozefowicz et al. (2015) advocate setting the bias to 1 for the
of the gate
forget time of
atthe
initialization.
LSTM mo Otherwise
model,
del, describ
describedu do
ed in es not10.10
Sec. hav.e a chance to learn.
For example, Jozefowicz et al. (2015) advocate setting the bias to 1 for the
forget gate
Another of thetyp
common LSTM
ype model, describ
e of parameter is a ved in Sec.
ariance or 10.10 .
precision parameter. For
example, we can perform linear regression with a conditional variance estimate
Another
using the mo common
modeldel type of parameter is a variance or precision parameter. For
example, we can performp(linear y | x) = regression
N (y | wTwithx + ab, conditional
1/β ) variance estimate
(8.24)
using the model
where β is a precision parameter. p(y x) =We(can y wusuallyx + b,initialize
1/β ) variance or precision
(8.24)
parameters to 1 safely safely.. Another | approac
approach h
Ne can is to assume the initial w eigh
eights
ts
| usually initialize variance or precision are close
where β is a precision parameter.
enough to zero that the biases ma may W
y be set while ignoring the effect of the weigh weights,
ts,
parameters
then set thetobiases
1 safely
to. pro
Another
produce
duce the approac h ismarginal
correct to assumemean the initial
of thewoutput,
eights areandclose
set
enough
the to zero
variance that the biases
parameters to the ma y be set
marginal while ignoring
variance the effect
of the output of the
in the weighset.
training ts,
then set the biases to produce the correct marginal mean of the output, and set
Besides these simple constant or random metho methods ds of initializing mo model
del parame-
the variance parameters to the marginal variance of the output in the training set.
ters, it is possible to initialize mo modeldel parameters using mach machine ine learning. A common
Besides
strategy these simple
discussed in Partconstant
III of this or random
book is metho ds of initializing
to initialize a sup
supervisedmomo
ervised del
modelparame-
del with
ters, it is p ossible to initialize
the parameters learned by an unsup mo del parameters
unsupervised
ervised mo using
model mach ine learning. A
del trained on the same inputs. common
strategy discussed
One can also perform sup in Part Iervised training on a related atask.
I I
supervised of this b o ok is to initialize supervised
Even mo del with
performing
the
sup parameters
supervised learned by an unsup ervised mo del trained
ervised training on an unrelated task can sometimes yield an initialization on the same inputs.
that
One can
offers also
faster perform
conv ergencesup
convergence ervised
than training
a random on a related
initialization. Sometask. Eveninitialization
of these performing
supervised may
strategies training
yieldonfaster
an unrelated
conv
convergence task can
ergence andsometimes yield an initialization
better generalization because they that
offersdefaster
enco
encode convergence
information ab outthan
about the adistribution
random initialization.
in the initial Some of these of
parameters initialization
the mo
model.
del.
strategies may yield faster conv ergence and b etter generalization
Others apparently perform well primarily because they set the parameters to hav b ecause they
havee
encorigh
the de information
right t scale or setab out the
differen
different distribution
t units to compute in the initial functions
different parameters of eac
from thehmo
each del.
other.
Others apparently perform well primarily because they set the parameters to have
the right scale or set different units to compute different functions from each other.
8.5 Algorithms with Adaptiv
daptive
e Learning Rates

8.5 net
Neural Algorithms
network
work researc
researchers
herswith
hav Adaptiv
havee long realized ethat
Learning
the learningRates
rate was reliably one
of the hyperparameters that is the most difficult to set because it has a significant
Neural net
impact on work
mo
model researc
del hers have As
performance. longwerealized
ha
hav that the learning
ve discussed rateand
in Sec. 4.3 wasSec.
reliably
8.2, one
the
of the hyperparameters
cost is often highly sensitiv that is the most difficult to set b ecause it has a
sensitivee to some directions in parameter space and insensitiv significant
insensitivee
impact on mo del p erformance. As we ha ve discussed in
to others. The momentum algorithm can mitigate these issues somewhat, Sec. 4.3 and Sec. 8.2, but
the
cost
do es issooften
does highly
at the exp sensitiv
expense
ense etro
of in toducing
intro some directions
troducing in parameter space
another hyperparameter. andface
In the insensitiv e
of this,
to isothers.
it naturalThe momentum
to ask algorithm
if there is another can
wa
way y.mitigate thesee that
If we believ
elieve issuesthe
somewhat,
directionsbut
of
do es so
sensitivit
sensitivity at the exp ense of intro ducing another hyperparameter. In the
y are somewhat axis-aligned, it can make sense to use a separate learning face of this,
it is natural
rate for eaceachh to ask if there
parameter, and isautomatically
another way.adaptIf wethese
believ e that the
learning ratesdirections
throughoutof
sensitivit
the course y are somewhat axis-aligned, it can make sense to use a separate learning
of learning.
rate for each parameter, and automatically adapt these learning rates throughout
the course of learning.
306
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

The delta-bar-delta algorithm (Jacobs, 1988) is an early heuristic approac approach h


to adapting individual learning rates for mo model
del parameters during training. The
Thehdelta-bar-delta
approac
approach algorithm
is based on a simple idea: (ifJacobs , 1988deriv
the partial ) is ativ
an early
derivativ
ative heuristic
e of the approac
loss, with resp
respecth
ect
to aadapting
to given mo individual
model learning
del parameter, rates the
remains for same
modelsign,
parameters
then theduring
learningtraining. The
rate should
approac h is based on a simple
increase. If the partial deriv idea:
derivative if the
ative with resp partial
respect deriv ative of the loss, with
ect to that parameter changes sign, resp ect
to a given
then model parameter,
the learning rate shouldremains
decrease. the Of
same sign, this
course, thenkind
the learning rateonly
of rule can should
be
increase. If the partial deriv
applied to full batch optimization.ative with resp ect to that parameter changes sign,
then the learning rate should decrease. Of course, this kind of rule can only be
More recently
recently,, a num
numb ber of incremen
incremental tal (or mini-batch-based) methods ha haveve
applied to full batch optimization.
been introduced that adapt the learning rates of mo model
del parameters. This section
will More
brieflyrecently
review, aa few
numofber of incremen
these algorithms.tal (or mini-batch-based) methods have
been introduced that adapt the learning rates of model parameters. This section
will briefly review a few of these algorithms.
8.5.1 AdaGrad

8.5.1
The AdaGrad
AdaGrad algorithm, sho
shown
wn in Algorithm 8.4, individually adapts the learning
rates of all mo
model
del parameters by scaling them inv inversely
ersely prop
proportional
ortional to the square
The
ro
root AdaGrad algorithm, sho wn in Algorithm 8.4 , individually
ot of the sum of all of their historical squared values (Duc hiadapts
Duchi al.,,the
et al. 2011learning
). The
rates of all mo
parameters del the
with parameters by scaling
largest partial them
deriv ativeinv
derivative ofersely prop
the loss ortional
hav
have to the
e a corresp square
correspondingly
ondingly
root ofdecrease
rapid the sumin of all learning
their of their historical
rate, whilesquared
parametersvalues
with(Duc hi et
small al., 2011
partial deriv).atives
The
derivatives
parameters
ha
hav with
ve a relativ
relatively the largest partial deriv ative of the loss hav e a corresp
ely small decrease in their learning rate. The net effect is greater ondingly
progress in theinmore
rapid decrease theirgen
learning
gently rate,
tly slop
sloped while parameters
ed directions with small
of parameter partial derivatives
space.
have a relatively small decrease in their learning rate. The net effect is greater
In theincontext
progress the more of gen
con
convex
vexslop
tly optimization,
ed directionsthe
of A daGrad algorithm
parameter space. enjoys some
desirable theoretical prop
properties.
erties. Ho
Howev
wev
wever,
er, empirically it has been found that—for
In the context of
training deep neural netw con vex
network
ork mooptimization,
models—the
dels—the accumthe ulation
AdaGrad
accumulation algorithm
of squared enjoysfrom
gradients
gradientsfrom some
desirable
the theoretical
beginning properties.
of training canHoresult
wever,inempirically
a premature it has been
and founde that—for
excessiv
excessive decrease
training
in deep neural
the effective netwrate.
learning ork mo Adels—the
daGrad paccum
AdaGrad erformsulation of squared
well for some but not allfrom
gradients deep
the b eginning
learning mo
models.
dels. of training can result in a premature and excessiv e decrease
in the effective learning rate. AdaGrad performs well for some but not all deep
learning models.
8.5.2 RMSProp

8.5.2
The RMSProp
RMSProp algorithm (HinHinton
ton, 2012) mo modifies
difies AdaGrad to perform better in the
non-con
non-conv vex setting by changing the gradien
gradientt accumulation into an exp exponentially
onentially
The
weigh RMSProp
ted moving average. AdaGrad is )designed
eighted algorithm ( Hinton , 2012 modifies toAdaGrad
conv
convergeto rapidly
erge performwhen
betterapplied
in the
non-con
to a convvex
convexex setting
function.by When
changing the gradien
applied t accumulation
to a non-conv
non-convex into to
ex function antrain
exponentially
a neural
w eigh
net
netw ted moving average.
work, the learning tra AdaGrad
trajectory is designed to conv
jectory may pass through many differen erge rapidly when applied
differentt structures and
to
ev
evena
en conv
entually ex function. When applied
tually arrive at a region that is a lo to a non-conv
locally
cally con
conv ex function to train a
vex bowl. AdaGrad shrinks the neural
network, rate
learning the learning
accordingtrato
jectory may pass
the entire through
history of themany differen
squared t structures
gradient and
and ma
mayy
ev
ha
haven tually arrive at a region
ve made the learning rate to that
too is a lo cally con vex b owl. AdaGrad
o small before arriving at such a conv convexshrinks the
ex structure.
learning rate according
RMSProp uses an exp to
exponen
onen the
onentiallyentire
tially deca
decayinghistory of the squared gradient
ying average to discard history from and mathey
have made the learning rate too small before arriving at such a convex structure.
RMSProp uses an exponentially decaying average to discard history from the
307
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.4 The AdaGrad algorithm


Require: Global learning rate 
Algorithm 8.4 The AdaGrad algorithm
Require: Initial parameter θ
Require:
Require: Global learning δrate
Small constant 
, perhaps 10−7, for numerical stability
Require:
InitializeInitial
gradientparameter
accum
accumulationθ
ulation variable r = 0
Require: Small constant δ ,
while stopping criterion not met dop erhaps 10 , for numerical stability
Initialize gradient accum ulation v
Sample a minibatch of m examples fromrthe ariable = 0training set {x(1), . . . , x (m)} with
while stopping
corresp ondingcriterion
corresponding targets ynot (i). met do
Sample a minibatch m 1examples P from (the set x , . . . , x
Compute gradient: gof← m ∇θ
i) training
i L(f (x ; θ ), y )
(i) with
corresp
A ccum onding
ccumulate
ulate targetsgradien
squared y . t: r ← r + g  g
gradient: { }
Compute
Compute up gradient: g L (f ( x ; θ )
date: ∆θ ← − δ+√r  g. (Division and square ro
update: , y ) rootot applied
Accumulate squared gradien ← ∇ t: r r+g g
elemen
element-wise)
t-wise)
Compute update: ∆θ ← g. (Division
 and square root applied
Apply up date: θ ← θ + ∆θ
update:
endelemen
whilet-wise) ←− 
Apply update: θ θ + ∆θ P
end while ←
extreme past so that it can conv converge
erge rapidly after finding a con convex
vex bowl, as if it
were an instance of the AdaGrad algorithm initialized within that bowl.
extreme past so that it can converge rapidly after finding a convex bowl, as if it
wereRMSProp
an instanceis shown
of the in its standard
AdaGrad form initialized
algorithm in Algorithm 8.5 that
within and combined
bowl. with
Nestero
Nesterov v momentum in Algorithm 8.6. Compared to AdaGrad, the use of the
mo RMSProp
moving
ving averageisintroduces
shown in its a newstandard
hyp form in Algorithm
yperparameter,
erparameter, 8.5 andthe
ρ, that controls combined with
length scale
Nestero
of the mov ving
momentum
moving av erage.in Algorithm 8.6. Compared to AdaGrad, the use of the
average.
moving average introduces a new hyperparameter, ρ, that controls the length scale
Empirically
Empirically,, RMSProp has been shown to be an effective and practical op-
of the moving average.
timization algorithm for deep neural net networ
wor
works. ks. It is currently one of the go-to
Empirically
optimization metho , RMSProp
methods has
ds being employ b een
employed ed routinely bby
shown to e andeep effective and
learning practical op-
practitioners.
timization algorithm for deep neural networks. It is currently one of the go-to
optimization methods being employed routinely by deep learning practitioners.
8.5.3 Adam

8.5.3
A Adamand Ba, 2014) is yet another adaptive learning rate optimization
dam (Kingma
algorithm and is presen
presented
ted in Algorithm 8.7. The name “Adam” derives from
A dam
the (Kingma
phrase and Bamoments.”
“adaptive , 2014) is yetIn another adaptive
the context learning
of the earlierrate optimization
algorithms, it is
algorithm and is presen ted in Algorithm 8.7. The name
perhaps best seen as a variant on the combination of RMSProp and momen “Adam” derives from
momentumtum
the phrase “adaptive moments.” In the context
with a few important distinctions. First, in Adam, momen of the earlier
momentum algorithms,
tum is incorp it is
incorporated
orated
p erhaps b est seen as a v ariant on the combination
directly as an estimate of the first order momen of
momentt (with expRMSProp
exponen
onen and
onential momen
tial weigh
weighting) tum
ting) of
with a few important distinctions.
the gradient. The most straightforw
straightforwardFirst,
ard wa
way in Adam, momen tum
y to add momentum to RMSPropis incorp orated
is to
directly
apply as an estimate
momentum to theofrescaled
the firstgradien
order ts.
gradients.momen
The tuse
(with exponentialinweigh
of momentum ting) of
combination
the gradient.
with rescaling The
do es most
does straightforw
not hav
have ard way tomotiv
e a clear theoretical add momentum
motivation.
ation. Second,to RMSProp
Adam includesis to
apply momentum to the rescaled gradien ts. The use
bias corrections to the estimates of both the first-order momen of momentum
moments in combination
ts (the momentum
with rescaling do es
term) and the (uncen not hav
(uncentered) e a clear theoretical
tered) second-order momen
moments motiv ation. Second,
ts to account for theirAdam includes
initialization
bias corrections to the estimates of both the first-order moments (the momentum
term) and the (uncentered) second-order308 moments to account for their initialization
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.5 The RMSProp algorithm


Require: Global learning rate , decay rate ρ.
Algorithm 8.5 The RMSProp algorithm
Require: Initial parameter θ
Require:
Require: SmallGlobalconstan
learningt δrate
constant , decay
, usually 10rate
−6 ρ.
, used to stabilize division b by
y small
Require:
num
umb bers.Initial parameter θ
Require:
InitializeSmall δ , usually
constantvariables
accumulation r =100 , used to stabilize division by small
num b ers.
while stopping criterion not met do
Initialize
Sample accumulation
a minibatch ofvm ariables r = from
examples 0 the training set {x(1), . . . , x (m)} with
while stopping
corresp ondingcriterion
corresponding targets ynot(i) met do
. P
Compute gradient: g ← m ∇θ i Lfrom
Sample a minibatch of m examples
1
(f (x(the
i) ; θtraining
), y (i) ) set x , . . . , x with
corresp
Accum onding
ccumulate targets y
ulate squared gradien .
gradient: t: r ← ρr + (1 − ρ)g  g { }
Compute gradient:
Compute parameter up g L (f ( x
 ; θ ) , y ) 1
date: ∆θ = − √ δ+r  g. ( √δ+r applied element-wise)
update:
Accumulate squared gradien ∇t: r ρr + (1 ρ)g g
Apply up date: θ ← θ ←
update: + ∆θ
Compute parameter update: ∆θ = ← −g. (  applied element-wise)
end while
Apply update: θ θ + ∆θ − 
end while P

at the origin (see Algorithm 8.7). RMSProp also incorp incorporates
orates an estimate of the
(uncen
(uncentered)
tered) second-order moment, ho howwevever
er it lacks the correction factor. Th Thus,us,
at the
unlik origin (see Algorithm 8.7 ). RMSProp also
unlikee in Adam, the RMSProp second-order moment estimate may hav incorp orates an estimate of the
havee high bias
(uncen
early intered) second-order
training. Adam is moment,
generally ho wever itaslacks
regarded beingthe correction
fairly robust factor. Thus,
to the choice
unlik
of e in Adam, the though
hyperparameters, RMSProp thesecond-order
learning ratemoment
sometimes estimate
needs may
to behav e high from
changed bias
earlysuggested
the in training. Adam is generally regarded as being fairly robust to the choice
default.
of hyperparameters, though the learning rate sometimes needs to be changed from
the suggested default.
8.5.4 Cho
Choosing
osing the Righ
Rightt Optimization Algorithm

8.5.4
In Choosing
this section, the Righ
we discussed t Optimization
a series Algorithm
of related algorithms that each seek to address
the challenge of optimizing deep mo models
dels by adapting the learning rate for each
In
mo this
model section, we A
del parameter. discussed a series
t this point, of related
a natural algorithms
question that
is: whic
which each seekshould
h algorithm to address
one
the
cho challenge
hoose?
ose? of optimizing deep mo dels by adapting the learning rate for each
model parameter. At this point, a natural question is: which algorithm should one
Unfortunately
Unfortunately,, there is curren
choose? currently
tly no consensus on this poin oint.
t. Schaul et al. (2014)
presen
presented
ted a valuable comparison of a large num numbber of optimization algorithms
Unfortunately
across a wide range , there is currentasks.
of learning tly noWhile
consensus on this suggest
the results point. Schaul et al.
that the (2014
family of)
presented awith
algorithms valuable
adaptivecomparison of a large
learning rates numberbof
(represented optimization
y RMSProp andalgorithms
AdaDelta)
across a wide range
performed fairly robustly of learning tasks. While the results
robustly,, no single best algorithm has emerged. suggest that the family of
algorithms with adaptive learning rates (represented by RMSProp and AdaDelta)
Curren
Currentlytly
tly,, the most popular optimization algorithms actively in use include
performed fairly robustly, no single best algorithm has emerged.
SGD, SGD with momentum, RMSProp, RMSProp with momen momentum,tum, AdaDelta
Curren tly , the most p opular optimization
and Adam. The choice of which algorithm to use, at this poinalgorithms actively
oint, in use
t, seems to include
dep
depend
end
SGD, SGD with momentum, RMSProp, RMSProp with momen
largely on the user’s familiarity with the algorithm (for ease of hyperparameter tum, AdaDelta
and Adam. The choice of which algorithm to use, at this point, seems to depend
tuning).
largely on the user’s familiarity with the algorithm (for ease of hyperparameter
tuning). 309
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.6 RMSProp algorithm with Nestero Nesterov v momentum


Require: Global learning rate , decay rate ρ, momentum co coefficient
efficient α.
Algorithm 8.6 RMSProp algorithm with Nesterov momentum
Require: Initial parameter θ, initial velocity v.
Require:
InitializeGlobal learningvariable
accumulation rate , decay
r = 0 rate ρ, momentum coefficient α.
Require: Initial parameter
while stopping θ, initial
criterion not met dovelocity v.
Initialize
Sample accumulation
a minibatch ofvm ariable r = 0from the training set {x(1), . . . , x (m)} with
examples
while stopping
corresp criterion not
onding targets y(i).
corresponding met do
Sample a minibatch
Compute interim up of m examples
update:
date: θ˜ ←P αv the training set x , . . . , x
θ +from with
corresponding
Compute targets
gradient: g← y 1. ∇ ( i) ˜ (i) { }
m ˜ θ̃ i L(f (x ; θ ), y )
Compute
A ccum
ccumulate interim
ulate update:
gradient: r ←θρr +θ (1 +−αvρ)g  g
Compute gradient: g ˜
Compute velocity up date: v←
update: ← αv −f (√x
L (  ;θ
r
 g)., y ( √1)r applied element-wise)
Accumulate gradient: ← r ρr + (1 ρ)g g
Apply up date: θ ← θ + v ∇
update:
Compute velocity update: ←v αv − g . ( applied element-wise)
end while
Apply update: θ θ+v ← − 
end while P

8.6 Appro
Approximate
ximate Second-Order Metho
Methods
ds

8.6this section
In Appro weximate
discuss the Second-Order Methometho
application of second-order ds ds to the training
methods
of deep net
netw
works. See LeCun et al. (1998a) for an earlier treatment of this sub subject.
ject.
In this section w e
For simplicity of expdiscuss
exposition, the application
osition, the only ob of
objectiv
jectiv second-order metho ds to the training
jectivee function we examine is the empirical
of deep networks. See LeCun et al. (1998a) for an earlier treatment of this sub ject.
risk:
For simplicity of exposition, the only ob jective function we examine is the empirical
m
risk: 1 X
J (θ) = E x,y∼p̂data (x,y)[L(f (x; θ ), y)] = L(f (x (i); θ), y (i) ). (8.25)
m
E 1 i=1
J (θ) = [L(f (x; θ ), y)] = L(f (x ; θ), y ). (8.25)
Ho
Howwevever
er the metho
methodsds we discuss here extend m readily to more general ob objectiv
jectiv
jectivee
functions that, for instance, include parameter regularization terms such as those
However the
discussed methods7.we discuss here extend readily to more general ob jective
in Chapter
functions that, for instance, include parameter regularization
X terms such as those
discussed in Chapter 7.
8.6.1 Newton’s Metho
Methodd

8.6.1
In Newton’s
Sec. 4.3 Metho
, we introduced d
second-order gradient metho
methods.
ds. In contrast to first-
order metho
methods, ds, second-order metho
methods
ds make use of second deriv
derivatives
atives to improv
improvee
In Sec. 4.3 , we introduced second-order gradient
optimization. The most widely used second-order metho metho
method ds. In contrast
d is Newton’s metho tod.first-
method. We
order
no
now metho
w describ ds, second-order
describee Newton’s methomethodd in more detail, with emphasis on its application toe
metho ds make use of second deriv atives to improv
optimization.
neural netw
networkorkThe most widely used second-order method is Newton’s method. We
training.
now describe Newton’s method in more detail, with emphasis on its application to
Newton’s metho
method d is an optimization sc
scheme
heme based on using a second-order Tay-
neural network training.
lor series expansion to approximate J( θ) near some poin ointt θ 0, ignoring deriv
derivativ
ativ
atives
es
Newton’s method is an optimization scheme based on using a second-order Tay-
lor series expansion to approximate J( θ) near some point θ , ignoring derivatives
310
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.7 The Adam algorithm


Require: Step
Algorithm 8.7 size  (Suggested default: 0.001)
The Adam algorithm
Require: Exp Exponen
onen
onential
tial deca
decay momentt estimates, ρ1 and ρ2 in [0
y rates for momen [0,, 1)
1)..
Require: Step size  (Suggested
(Suggested defaults: 0.9 and 0.999 respdefault: 0 .001
respectively)
ectively) )
Require: Small
Require: Exponen tial deca
constant y rates
δ used t estimates, ρ(Suggested
for momenstabilization.
for numerical and ρ indefault:
[0 , 1).
(Suggested
10 −8 ) defaults: 0.9 and 0.999 respectively)
Require: Small
Require: Initial constant
parametersδ used
θ for numerical stabilization. (Suggested default:
10 )
Initialize 1st and 2nd momen
momentt variables s = 0, r = 0
Require:
InitializeInitial parameters
time step t=0 θ
Initialize
while 1st andcriterion
stopping 2nd momennot tmetvariables
do s = 0, r = 0
Initialize
Sample time step t =of0m examples from the training set {x(1), . . . , x (m)} with
a minibatch
while stopping
corresp ondingcriterion
corresponding (i). met do
targets ynot
Sample a minibatch m 1examplesP from (the set x , . . . , x
Compute gradient: gof← m ∇θ
i) training
i L(f (x ; θ ), y )
(i) with
tcorresp
← t +onding
1 targets y . { }
Compute
Up
Update gradient:
date biased first g
moment estimate: L(f (xs ←; θρ),sy+ (1 ) − ρ )g
1 1
tUp t + 1 ← ∇
date biased second moment estimate: r ← ρ2r + (1 − ρ2 )g  g
Update
Up
←date biased
Correct bias in first
first moment
momen
moment: t:estimate:
ˆ ← 1−ρ
s s s ρ s + (1 ρ )g
t
Update biased second moment estimate:r ← 1 r ρ r +− (1 ρ )g g
Correct bias in second momenmoment: t: rˆ ← 1−ρ t
P
Correct bias in first moment: s ˆ 2 ← − 
Compute up
update:
date: ∆ θ = −  √ ŝ (op
(operations
erations applied elemen
element-wise)
t-wise)
Correct bias in second momen r̂+δ rˆ
t: ←
Apply up date: θ ← θ + ∆θ
update: ←
Compute update: ∆θ =  (operations applied element-wise)
end while
Apply update: θ θ+∆ −θ
end while ←
of higher order:

of higher order: 1
J (θ) ≈ J (θ0) + (θ − θ 0) >∇θ J (θ0 ) + (θ − θ 0) > H (θ − θ0 ), (8.26)
2
1
J ( θ ) J (θ ) + ( θ θ ) J (θ
where H is the Hessian of J with respect to θ 2ev) + θ ) at
(θaluated
evaluated Hθ(θ. If θwe), then solv
(8.26)
solve
e for
0

the critical poin − ∇ − −
ointt of this function, we obtain the Newton parameter up update
date rule:
where H is the Hessian of J with respect to θ evaluated at θ . If we then solve for
θ ∗ = θwe
the critical point of this function, 0 − H −1∇the
obtain Newton parameter update(8.27)
θJ (θ 0)
rule:

Th
Thus
us for a lo cally quadraticθ function
locally =θ H (θ ) definite H ), by rescaling
(with pJositive (8.27)
−1
gradientt by H , Newton’s metho
the gradien − d jumps
method ∇ directly to the minimum. If the
Th
ob us for
objectiv
jectiv a lo cally quadratic
jectivee function is conv ex but not quadratic p(there
convex function (with ositivearedefinite H ), bterms),
higher-order y rescaling
this
the
up gradien
update
date can t
b by
e H ,
iterated, Newton’s
yielding metho
the d jumps
training directly
algorithm to
assothe minimum.
associated
ciated with If the
Newton’s
ob jectiv
metho
method, d,e given
function is convex but
in Algorithm 8.8. not quadratic (there are higher-order terms), this
update can be iterated, yielding the training algorithm associated with Newton’s
For surfaces that are not quadratic, as long as the Hessian remains positive
method, given in Algorithm 8.8.
definite, Newton’s metho method d can be applied iteratively
iteratively.. This implies a tw two-step
o-step
For surfaces that are not quadratic, as long as the Hessian remains positive
definite, Newton’s method can be applied 311 iteratively. This implies a two-step
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.8 Newton’s metho


method d with ob
objectiv
jectiv
jectivee J (θ) =
1 Pm (i); θ ), y (i)).
m i=1 L (f ( x
Algorithm 8.8 Newton’s method with ob jective J (θ) =
Require: L(fInitial
(x ; θ parameter
), y ). θ0
Require: Training set of m examples
Require: Initial parameter
while stopping criterion not θ met do
Require: P
Compute gradient: g ← examples
T raining set of m 1
m ∇ θ
( i)
i L(f (x ; θ ), y )
(i)
while stopping criterion not 1 met Pdo
PCompute Hessian: H ← m ∇2θ i L(f (x(i) ; θ), y (i) )
Compute gradient: g L(f (x ; θ), y )
Compute Hessian in invverse: H−1
Compute up
Compute Hessian:
update:
date: ∆ Hθ← = −∇ H−1 g L(f (x ; θ), y )
Compute
Apply up Hessian
update:
date: θ =inθverse:
+←∆θH ∇
Compute
end while up date: ∆ θ = H g
Apply update: θ = θ + ∆ P
−θ
end while P
iterativ
iterativee pro
procedure.
cedure. First, up update
date or compute the in inv
verse Hessian (i.e. by up updating
dating
the quadratic approximation). Second, up update
date the parameters according to Eq.
iterativ
8.27
8.27.. e pro cedure. First, up date or compute the inverse Hessian (i.e. by updating
the quadratic approximation). Second, update the parameters according to Eq.
In Sec. 8.2.3, we discussed how Newton’s metho method d is appropriate only when
8.27.
the Hessian is positive definite. In deep learning, the surface of the ob objective
jective
In Sec. 8.2.3 , we
function is typically non-con discussed
non-convex how Newton’s metho
vex with many features, suc d is
such h as saddle points,when
appropriate only that
the Hessian is p ositive
are problematic for Newton’s metho definite. In
method. deep learning,
d. If the eigenv the
eigenvalues surface of the ob
alues of the Hessian are not jective
function
all is typically
positive, for example, non-connearvex with many
a saddle point,features, such asmetho
then Newton’s saddle
method points,
d can that
actually
are problematic
cause up dates tofor
updates movNewton’s
move e in themetho
wrongd.direction.
If the eigenv Thisalues of the Hessian
situation can be av are not
avoided
oided
all
b yp ositive, for example,
regularizing the Hessian. nearCommon
a saddle p oint, then Newton’s
regularization strategiesmetho d canadding
include actuallya
cause
constan
constant,up dates to mov e in the wrong direction.
t, α, along the diagonal of the Hessian. The regularized up This situation can
update b e av
date becomes oided
by regularizing the Hessian. Common regularization strategies include adding a
−1
constant, α, along theθ∗diagonal = θ0 − [of H the
(f (θHessian.
0 )) + αI ]The ∇regularized
θ f (θ0 ). update becomes (8.28)

This regularization strategyθ = θ is used[H (fin(θapproximations


)) + αI ] fto
(θ Newton’s
). metho
method, d,(8.28)
suc
suchh
as the Leven
Levenb b erg–Marquardt −algorithm ( Leven
Levenb erg
b ∇ , 1944 ; Marquardt , 1963 ), and
This
w orksregularization strategy
fairly well as long as theisnegative
used in eigenv
approximations
alues of thetoHessian
eigenvalues Newton’s
are metho d, such
still relatively
as thetoLeven
close zero.berg–Marquardt
In cases where therealgorithm (Leven
are more berg, 1944
extreme ; Marquardt
directions of curv, 1963
curvature,
ature,), and
the
w orks fairly w ell
value of α would hav as long as the negative
havee to be sufficien
sufficiently eigenv alues of the Hessian
tly large to offset the negativ are still
negativee eigenrelatively
eigenv values.
close
Ho
How wevto
ever,zero.α In cases where there are more extreme directions
er, as increases in size, the Hessian becomes dominated by the of curv
α Iature,
diagonalthe
vand
aluethe α would hav
of direction e tobbyeNewton’s
chosen sufficiently large
metho
method to offset
d conv ergesthe
converges negativ
to the e eigen
standard values.
gradient
However,by
divided asαα. increases
When strongin size, the Hessian
negative curv becomes
curvature
ature dominated
is present, by ythe
α ma
may αI to
need diagonal
be so
and the direction chosen
large that Newton’s metho methodb y Newton’s metho d conv erges to the standard
d would make smaller steps than gradient descent with gradient
divided
a prop by
properly α . When strong
erly chosen learning rate. negative curvature is present, α may need to be so
large that Newton’s method would make smaller steps than gradient descent with
Bey
Beyond
ond the challenges created by certain features of the ob objective
jective function,
a properly chosen learning rate.
suc
suchh as saddle poin oints,
ts, the application of Newton’s metho method d for training large neural
net
netw Bey ond the challenges created by certain
works is limited by the significant computational burden features of the obitjective
imp
imposes. function,
oses. The
such as saddle points, the application of Newton’s method for training large neural
networks is limited by the significant312 computational burden it imposes. The
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

num
umb ber of elemen
elements
ts in the Hessian is squared in the num numbber of parameters, so with
k parameters (and for even very small neural netw networks
orks the num
numb ber of parameters
num b er of elements in the Hessian
k can be in the millions), Newton’s metho is squared
method in the num b er of
d would require the inv parameters,
ersion of asokwith
inversion ×k
k parameters (and for even very small neural netw 3 orks the num
matrix—with computational complexity of O( k ). Also, since the parameters b er of parameters
k can
will be in with
change the millions),
every up Newton’s
update,
date, metho
the inv ersedHessian
inverse would require
has to the inversion of
be computed ataevk ery
every k
matrix—with
training computational
iteration complexity
. As a consequence, of O
only ( k orks
netw ). Also,
networks since
a with verythe parameters
small num
numb ×
ber
will change with every up date, the inv erse Hessian
of parameters can be practically trained via Newton’s metho has to b e
method.computed at ev
d. In the remainder ery
training iteration
of this section, w . As a consequence,
wee will discuss alternativ only
es that attempt tovery
alternatives netw orks a with gainsmall
somenumof bthe
er
of
advparameters
advanan
antages can b e practically
tages of Newton’s methomethod trained via Newton’s metho d. In the
d while side-stepping the computational hurdles. remainder
of this section, we will discuss alternatives that attempt to gain some of the
advantages of Newton’s method while side-stepping the computational hurdles.
8.6.2 Conjugate Gradien
Gradients
ts

8.6.2 Conjugate
Conjugate gradients isGradien
a metho
method dtsto efficiently av avoid
oid the calculation of the inv inverse
erse
Hessian by iteratively descending conjugate dir direections
ctions.. The inspiration for this
Conjugate
approac
approach gradients is a metho d to efficiently
h follows from a careful study of the weakness av oid theofcalculation
the metho
method of
d the inverse
of steep
steepestest
Hessian
descen by iteratively
descentt (see Sec. 4.3 for descending conjugate
details), where linedir ections
searc
searcheshes .areThe inspiration
applied for this
iteratively in
approac h follows
the direction asso from
associated a careful study of the w eakness of the
ciated with the gradient. Fig. 8.6 illustrates how the metho metho d of steep
method est
d of
descen
steep
steepest t (see Sec. 4.3 for details), where line searc hes are applied
est descent, when applied in a quadratic bowl, progresses in a rather ineffective iteratively in
the
bac direction
back-and-forth, asso ciated with the
k-and-forth, zig-zag pattern. This happ gradient.
happensFig. 8.6 illustrates
ens because each line searc how the
search metho d
h direction, of
steepest
when descent,
given by thewhen appliedisinguaranteed
gradient, a quadraticto bowl, progresses in
be orthogonal toathe
rather ineffective
previous line
bac k-and-forth,
searc
search h direction.zig-zag pattern. This happens because each line search direction,
when given by the gradient, is guaranteed to be orthogonal to the previous line
Let the previous searc search h direction be dt−1. At the minim minimum, um, where the line
search direction.
searc
search h terminates, the directional deriv ative is zero in direction dt−1: ∇ θ J (θ ) ·
derivative
dt−1Let= the previous
0. Since the searc h direction
gradient at this b peoindt defines
oint . At the theminim
current um,search
wheredirection,
the line
searc h terminates,have the directional derivinative is zero in direction
. Thus ddt is :orthogonalJ (θ )
d t = ∇ θJ (θ ) will hav e no contribution the direction dt−1
d d = .0. This
to Sincerelationship
the gradientbetw at this
betweeneen d pointand definesdt istheillustrated
current search
in Fig. direction,
∇ 8.6 for·
t−1 t−1
d = J (θ ) will hav
multiple iterations of steepe no contribution
steepest
est descen
descent. in the direction d . Thus d is
t. As demonstrated in the figure, the choice of orthogonal
to d ∇ . This
orthogonal relationship
directions of descentbetwdoeennotd preserve
and d theisminimum
illustrated in the
along Fig.previous
8.6 for
multiple
searc
search iterations ofThis
h directions. steep estesdescen
giv
gives rise to t. the
As demonstrated
zig-zag pattern in the figure, thewhere
of progress, choiceby of
orthogonal to
descending directions of descent
the minimum in thedocurrent
not preserve
gradien
gradient the minimum
t direction, wealong
mustthe previous
re-minimize
searc
the ob h directions.
objectiv
jectiv This giv es rise to the zig-zag
jectivee in the previous gradient direction. Thus, by follo pattern of wing the gradient by
progress,
following where at
descending to the minimum
the end of each line searc search in the current gradien t direction,
h we are, in a sense, undoing progress we hav we must re-minimize
havee already
the obin
made jectiv
the edirection
in the previous gradientline
of the previous direction. Thus,
search. The by follo
method ofwing the gradient
conjugate gradients at
the end
seeks to of each line
address this searc h we are, in a sense, undoing progress we have already
problem.
made in the direction of the previous line search. The method of conjugate gradients
In to
seeks theaddress
metho
method dthis
of conjugate
problem. gradients, we seek to find a search direction that is
conjugate to the previous line search direction, i.e. it will not undo progress made
In the
in that method of
direction. Atconjugate gradients,t,wthe
training iteration e seek
next to search
find a search
directiondirection
dt tak
takesthat
es the is
conjugate to the previous line search direction, i.e. it will not undo progress made
form:
in that direction. At trainingd iteration t, the next search direction d tak(8.29) es the
t = ∇ θJ (θ ) + β tdt−1
form:
d = J 313
(θ) + β d (8.29)

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

20

10

−10

−20

−30
−30 −20 −10 0 10 20

Figure 8.6: The metho


method d of steep
steepest
est descent applied to a quadratic cost surface. The
metho
method d of steep
steepest
est descent ininvolv
volv
volves
es jumping to the p oint of low lowest
est cost along the line
Figure 8.6: The metho d of steep est descent applied to a quadratic
defined by the gradient at the initial point on each step. This resolves some costofsurface. The
the problems
seen with using a fixed learning rate in Fig. 4.6, but even with the optimal step size line
metho d of steep est descent involv es jumping to the p oint of lowest cost along the the
defined by the
algorithm stillgradient at the initial point
makes back-and-forth on eachto
progress step.
wardThis
tow the resolves
optimum. someByofdefinition,
the problemsat
seenminimum
the with usingofathefixed
ob learning
objectiv
jectiv rate aingiven
jectivee along Fig. 4.6, but even
direction, thewith the optimal
gradient step psize
at the final ointthe
is
algorithm still makes back-and-forth
orthogonal to that direction. progress tow ard the optimum. By definition, at
the minimum of the ob jective along a given direction, the gradient at the final p oint is
orthogonal to that direction.
were βt is a co
coefficient
efficient whose magnitude controls ho howw muc
much
h of the direction, d t−1 ,
direction,d
we should add back to the curren
currentt search direction.
were β is a coefficient whose magnitude controls how much of> the direction, d ,
Two directions,
we should t and
add back dto the d t−1, are
curren defined
t search as conjugate if d t H (J )dt−1 = 00..
direction.
Two directions, d and d d>
t H
, are d t−1 =
defined as0conjugate if d H (J )d =(8.30)
0.

The straightforw
straightforward
ard wa
way d ose
y to imp
impose = 0 would in
H dconjugacy involvee calculation (8.30)
volv of the
eigenvectors of H to choose β t, whic
eigenv which
h would not satisfy our goal of developing
The
a metho
method straightforw ard wa y to imp ose
d that is more computationally conjugacy would
viable than involve calculation
Newton’s metho
method d foroflarge
the
eigenv ectors of H to choose β , which would not satisfy our goal of developing
problems. Can we calculate the conjugate directions without resorting to these
a method thatFortunately
calculations? is more computationally viable
the answer to that than Newton’s method for large
is yes.
problems. Can we calculate the conjugate directions without resorting to these
Two popular metho
methods
ds for computing the β are:
calculations? Fortunately the answer to that ist yes.
1.TwFletc
o popular es: ds for computing the β are:
metho
Fletcher-Reev
her-Reev
her-Reeves:
∇ θ J (θt )> ∇θ J (θt )
1. Fletcher-Reeves: βt = (8.31)
∇θ J (θt−1 )> ∇θ J (θt−1)
J (θ ) J (θ )
β = (8.31)
2. PPolak-Ribière:
olak-Ribière: ∇J (θ ) ∇ J (θ )
2. Polak-Ribière: θt ) − ∇θ J (θ∇t−1)) > ∇θJ (θt )
(∇θJ (∇
βt = (8.32)
∇ J (θt−1 )> ∇θ J (θt−1)
( J (θθ) J (θ )) J (θ )
β = (8.32)
∇ J−314
(θ∇ ) J (θ ∇)
∇ ∇
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

For a quadratic surface, the conjugate directions ensure that the gradient along
the previous direction do does
es not increase in magnitude. We therefore stay at the
For a
minim
minimumquadratic
um along the previousconjugate
surface, the directions.directions ensure that in
As a consequence, thea gradient along
k -dimensional
the previous
parameter direction
space, does gradients
conjugate not increase inrequires
only magnitude.
k lineWsearches
e therefore stay
to ac at the
achieve
hieve the
minim
minim um
minimum. along the previous directions. As a consequence,
um. The conjugate gradient algorithm is given in Algorithm 8.9. in a k -dimensional
parameter space, conjugate gradients only requires k line searches to achieve the
minimum. The
Algorithm 8.9 conjugate
Conjugategradient
gradienttalgorithm
gradien metho
method d is given in Algorithm 8.9.
Require: Initial parameters θ0
Algorithm 8.9 Conjugate gradien t method
Require: Training set of m examples
Require:
InitializeInitial
ρ0 = 0parameters θ
Require:
Initialize graining
T set of m examples
0 =0
Initialize tρ==1 0
Initialize
Initialize
while g = 0 criterion not met do
stopping
Initialize
Initializet= 1 gradien
the gradientt gt = 0 P
while stopping
Compute gradient: criterion
gt ← not1 ∇met doL(f (x (i) ; θ), y(i) )
m θ i
Initialize the gradient−1 t g) > =
gt 0
Compute βt = (gtg−g (P
(Polak-Ribière)
olak-Ribière)
L(f (x ; θ), y )
Compute gradient:t−1 g t−1
> g

(Nonlinear
Compute β = conjugate gradien
gradient:
← ∇ (Polak-Ribière)reset βt to zero, for example if t is
t: optionally
a multiple of some constant k, suc suchh as k = 5)
(Nonlinear conjugate gradien t: optionally
Compute search direction: ρt = −gt + βtρ t−1 reset β to zero, for example if t is
Pa erform
multiple of searc
line somehconstant
search to find: k∗, = suc
P h as k =1 5P
argmin ) m L(f (x (i); θ + ρ ), y(i))
 m i=1 t t
Compute
(On a trulysearch direction:
quadratic = g + analytically
costρ function, βρ solve for ∗ rather than
Perform line
explicitly search for
searching it)  =−argmin
to find: L(f (x ; θ + ρ ), y )
(On a
Apply up truly quadratic cost
date: θt+1 = θ t +  ∗ρ t
update: function, analytically solve for  rather than
texplicitly
← t + 1 searching for it)
endApply
whileupdate: θ = θ + ρ
t t+1 P
end←while
Nonlinear Conjugate Gradients: So far we hav havee discussed the metho method d of
conjugate gradients as it is applied to quadratic ob objectiv
jectiv
jectivee functions. Of course,
Nonlinear Conjugate Gradients: So far we
our primary interest in this chapter is to explore optimization hav e discussed
metho the
methods metho
ds for d of
training
conjugate
neural netw gradients
networks as it is applied to quadratic
orks and other related deep learning mo ob jectiv
models e functions.
dels where the corresp Of course,
corresponding
onding
our
ob primary
objectiv
jectiv interest in this chapter is to explore optimization
jectivee function is far from quadratic. Perhaps surprisingly metho ds for
surprisingly,, the metho training
method d of
neural netw
conjugate orks andisother
gradients related deep
still applicable learning
in this mothough
setting, dels where
with the
somecorresp
mo onding
modification.
dification.
ob jectiveany
Without function is far
assurance thatfrom quadratic.
the ob
objectiv
jectiv Perhaps surprisingly
jectivee is quadratic, the conjugate , the methodare
directions of
conjugate gradients is still applicable
no longer assured to remain at the minim in
minimumthis setting,
um of the ob though
objectiv
jectiv with some mo dification.
jectivee for previous directions.
Without any assurance
As a result, the nonline
nonlinear that the ob jectiv
ar conjugate gr e is
gradients quadratic,
adients algorithm theincludes
conjugate directions
occasional are
resets
no longer
where theassured
metho
method dtoofremain at thegradients
conjugate minimumisofrestarted
the ob jectiv
withe for
lineprevious directions.
search along the
As a result, the
unaltered gradient.nonline ar conjugate gradients algorithm includes o ccasional resets
where the method of conjugate gradients is restarted with line search along the
Practitioners rep
report
ort reasonable results in applications of the nonlinear conjugate
unaltered gradient.
315in applications of the nonlinear conjugate
Practitioners report reasonable results
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

gradien
gradients ts algorithm to training neural netw networks,
orks, though it is often beneficial to
initialize the optimization with a few iterations of stochastic gradien
gradientt descent before
gradien ts algorithm to training neural netw orks, though
commencing nonlinear conjugate gradients. Also, while the (nonlinear)it is often beneficial to
conjugate
initializetsthe
gradien
gradients optimization
algorithm with a few iterations
has traditionally of stochastic
been cast as a batch gradien
methot descent
method, before
d, minibatc
minibatchh
commencing
versions hav nonlinear conjugate gradients. Also, while the
havee been used successfully for the training of neural netw (nonlinear)
networks conjugate
orks (Le et al.,
gradien
2011
2011).). A tsdaptations
algorithm of
Adaptations has traditionally
conjugate beensp
gradients cast as a batch
specifically
ecifically metho
for neural d, minibatc
netw
networks haveh
orks hav e
vbersions
een prop hav
proposede b een used successfully for the training of
osed earlier, such as the scaled conjugate gradien neural
gradients ts algorithm (Moller,,
netw orks (Le et al.
2011
1993
1993).). Adaptations of conjugate gradients specifically for neural networks have
).
been proposed earlier, such as the scaled conjugate gradients algorithm (Moller,
1993).
8.6.3 BF
BFGS
GS

8.6.3
The Br BFGS
Broyden–Fletcher–Goldfarb–Shanno
oyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm attempts to bring some
of the adv
advantages
antages of Newton’s metho
method d without the computational burden. In that
The
resp Broyden–Fletcher–Goldfarb–Shanno
respect,
ect, BFGS is similar to CG. How
Howevev (BFGS)
ever,
er, BFGSalgorithm
tak
takes attempts
es a more directtoapproach
bring someto
of the adv antages of Newton’s metho
the approximation of Newton’s up d
update.without the computational
date. Recall that Newton’s up burden.
update In that
date is given by
respect, BFGS is similar to CG. However, BFGS takes a more direct approach to
the approximation of Newton’sθ∗ =upθdate. −1
0 − HRecall
∇ θJthat
(θ0),Newton’s update is given by
(8.33)

where H is the Hessian of θJ with = θ respHect toJθ(θev


respect )aluated
,
evaluated at θ 0 . The primary(8.33)
computational difficult
difficulty y in applying− Newton’s ∇ up update
date is the calculation of the
where
in
inv H is the Hessian
−1
verse Hessian H . The approac of J with
approach resp ect to θ ev θ . The
aluated at metho
h adopted by quasi-Newton methods primary
ds (of which
computational difficult y in applying Newton’s up date
the BFGS algorithm is the most prominent) is to approximate the inv is the calculation
inverse of the
erse with
invmatrix
a erse Hessian
M t thatH is. iterativ
The approac
iteratively h adopted
ely refined by lo
lowbyrank
w quasi-Newton
up dates to
updates metho
become ds (of
a bwhich
etter
the
approBFGS
approximationalgorithm
ximation of H . −1is the most prominent) is to approximate the inverse with
a matrix M that is iteratively refined by low rank updates to become a better
approFrom Newton’s
ximation of Hupupdate,
date,
. in Eq. 8.33, we can see that the parameters at learning
steps t are related via the secan secantt condition (also known as the quasi-Newton
From Newton’s update, in Eq. 8.33, we can see that the parameters at learning
condition):
steps t are relatedθvia −the secan
θt = −Ht −1
condition
(∇θ J (θt+1(also
) − known
∇θ J (θtas)) the quasi-Newton (8.34)
t+1
condition):
Eq. 8.34 holds preciselyθ inθ the
= quadratic
H ( case, J (θ or) appro
approximately
Jximately
(θ )) otherwise.(8.34)The
appro
approximation
ximation to the Hessian inv
−in theinverse
erse ∇used
−quadratic in the BFGS pro
procedure
cedure is constructed
Eq.as8.34
so holds this
to satisfy precisely
condition, with case,oforH−
M in place
∇ ximately
appro
−1. Sp ecifically,,otherwise.
Specifically
ecifically M is up updatedThe
dated
approximation
according to: to the Hessian inverse used in the BFGS procedure is constructed
so as to satisfy this condition, with M in place of H . Specifically, M is updated
   
according to: φ > Mt−1 φ φ> φ ∆φ>M t−1 + M t−1φ∆>
Mt = Mt−1 + 1 + − , (8.35)
∆>φ ∆> φ ∆ >φ
φ M φ φ φ ∆φ M + M φ∆
M =M + 1+ , (8.35)
where g t = ∇θ J (θ t), φ =∆g t − φ g t−1 ∆andφ∆ = θt − θt−1.∆Eq. φ 8.35 shows that the

BF
BFGSGS pro
procedure
cedure iteratively refines the appro approximation
ximation of the inv inverse
erse of the Hessian
where g =
with rank up J ( θ φ = g g ∆ = θ θ
dates of rank one. This mean that if θ ∈ R , then theshows
updates ), and .
n Eq. 8.35 that the
computational
BFGS procedure ∇ iteratively refines − the approximation − R of the inverse of the  Hessian
with rank updates of rank one. This mean 316 that if θ , then the computational

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

complexit
complexityy of the up date is O( n2). The deriv
update derivation
ation of the BFGS approximation is
giv
given
en in man
many y textb
textbo ooks on optimization, including Luenberger (1984).
complexity of the update is O( n ). The derivation of the BFGS approximation is
Once the inv
inverse
erse Hessian appro
approximation
ximation Mt is updated, the direction of descent
given in many textbooks on optimization, including Luenberger (1984).
ρt is determined by ρt = Mt g t. A line searc search
h is performed in this direction to
Once the
determine theinv erseofHessian
size appro
the step, ∗ , ximation M is
taken in this updated,The
direction. the direction
final up of descent
update
date to the
ρ is determined
parameters is giv byby:
given
en ρ = M g . A line search is performed in this direction to
determine the size of the step,  θ, taken in this
∗ direction. The final up date to the
t+1 = θt +  ρt . (8.36)
parameters is given by:
The complete BFBFGSGS algorithm isθ presented
= θ + in
 ρAlgorithm
. 8.10. (8.36)
The complete8.10
Algorithm BFGS BFGSalgorithm
metho
method dis presented in Algorithm 8.10.
Require: Initial parameters θ
Algorithm 8.10 BFGS metho0d
Initialize inv erse Hessian M0 = I
inverse
Require: Initial parameters
while stopping criterion not θ met do
Initialize
Compute invgradient:
erse Hessiangt =M ∇θ= J (Iθt )
while
Computestoppingφ =criterion
gt − g t−1not
, ∆met =θtdo − θt−1
Compute gradient:
−1 g = J (θ ) φ >M t−1φ φ>φ  ∆φ > Mt−1 +Mt−1 φ∆> 
Appro
Approx x H : M = M + 1 + ∆> φ −
Compute φ = g t g t−1 ,∇∆=θ θ ∆>φ ∆ >φ
Compute search direction: ρt = Mt gt
Approx H : M−= M + ∗ 1 + −
Perform line searc search h to find:  = argmin J (θt + ρ t)
Compute searchθ direction: ρ∗ = M g −
Apply up
update:
date: t+1 = θ t +  ρ t
endPerform
while
while* *line search to find:  = argmin J (θ + ρ )
Apply update: θ = θ + ρ    
end while*
Lik
Likee the metho
method d of conjugate gradients, the BFGS algorithm iterates a series of
line searches with the direction incorp orating second-order information. Ho
incorporating Howev
wev
weverer
unlikLik e the metho d of conjugate gradients,
unlikee conjugate gradients, the success of the the approach
BFGS algorithm iteratesdep
is not heavily a series
dependent
endent of
linethe
on searches with the
line search findingdirection
a point incorp
veryorating
close tosecond-order information.
the true minimum along Hothewev er
line.
unlik
Th
Thus, e conjugate gradients,
us, relative to conjugate gradien the success
gradients, of the approach
ts, BFGS has the adv advanis
an not
antage heavily dep
tage that it can sp endent
spend
end
on the line search finding a p oint very close to the true
less time refining each line search. On the other hand, the BFGS algorithm must minimum along the line.
Th us, the
store relative
inv
inverse
erseto Hessian
conjugate gradien
matrix, Mts, BFGS
, that has the
requires O(advn2)an tage that
memory
memory, it can BFGS
, making spend
less time refining
impractical for most eachmo line
modern
dernsearch. On the other
deep learning mo
models hand,
dels thatthe BFGS ha
typically algorithm
hav must
ve millions of
store the
parameters. inv erse Hessian matrix, M , that requires O ( n ) memory , making BFGS
impractical for most modern deep learning models that typically have millions of
parameters.
Limited Memory BF BFGSGS (or L-BF L-BFGS) GS) The memory costs of the BF BFGSGS
algorithm can be significantly decreased by avoiding storing the complete inv inverse
erse
Limitedapproximation
Hessian Memory BF MGS (or L-BF
. Alternativ
Alternatively ely,GS)
ely The memory
, by replacing the Mt−1costsin ofEq.the
8.35BF GS
with
algorithm
an identit
identity ycan be significantly
matrix, the BF
BFGSGS decreased
search direction by avoiding
up datestoring
update form
formulathebecomes:
ula complete inverse
Hessian approximation M . Alternatively, by replacing the M in Eq. 8.35 with
an identity matrix, the BFGSρsearch t = −gdirection
t + b∆ +up aφdate
, formula becomes: (8.37)

ρ = g + b∆ + aφ, (8.37)
− 317
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

where the scalars a and b are given by:


 > φ  ∆>g
where the scalars a and b are given bφy: t φ> gt
a = − 1+ > + > (8.38)
∆ φ ∆>φ ∆ φ
φ φ ∆ g φ g
a = ∆>g1t+ + (8.38)
b= > ∆ φ ∆ φ ∆ φ (8.39)

∆ φ
∆ g
with φ and ∆ as defined abb = abov (8.39)
∆ov
ove.
e.
φ If used with exact line searches, the directions
defined by Eq. 8.37 are mutually  
conjugate. Ho
How wever, unlik
unlikee the metho
method d of
with φ andgradients,
conjugate ∆ as defined abov
this pro e. If used
procedure
cedure withwell
remains exact
behaline
ehavvedsearches,
when thethe directions
minim
minimum
um of
defined
the by Eq.is8.37
line search are only
reached mutually
appro conjugate.
approximately
ximately How
ximately.. This ever, unlik
strategy can ebethe method to
generalized of
conjugate gradients, this
include more information ab pro cedure remains well b eha ved when the minim
out the Hessian by storing previous values of φ and
about um of
the
∆ . line search is reached only approximately. This strategy can be generalized to
include more information about the Hessian by storing previous values of φ and
∆.
8.7 Optimization Strategies and Meta-Algorithms

8.7 y optimization
Man
Many Optimizationtec Strategies
techniques
hniques and algorithms,
are not exactly Meta-Algorithms
but rather general
templates that can be sp
specialized
ecialized to yield algorithms, or subroutines that can be
Man y orated
incorp optimization
incorporated tec
into man
manyy hniques
differentare not exactly algorithms, but rather general
algorithms.
templates that can be specialized to yield algorithms, or subroutines that can be
incorporated into many different algorithms.
8.7.1 Batc
Batch
h Normalization

8.7.1h normalization
Batc
Batch Batch Normalization (Ioffe and Szegedy, 2015) is one of the most exciting recen recentt
inno
innovv ations in optimizing deep neural netw
networks
orks and it is actually not an optimization
Batch normalization
algorithm at all. Instead,(Ioffeit and
is a Szegedy
metho
method d ,of2015 ) is one
adaptiv
adaptive of the most exciting
e reparametrization, motiv recen
motivatedatedt
inno
b vations
y the difficult
difficulty in optimizing
y of training deep neural
very deepnetw
mo orks
dels.and it is actually not an optimization
models.
algorithm at all. Instead, it is a method of adaptive reparametrization, motivated
Very deep mo models
dels in
involv
volv
volvee the comp
composition
osition of several functions or lay layers.
ers. The
by the difficulty of training very deep models.
gradien
gradientt tells how to up update
date each parameter, under the assumption that the other
la
lay Very
yers dodeep not mo dels inIn
change. volv e the comp
practice, weosition
up dateofall
update several
of the functions
lay
layers or layers. The.
ers simultaneously
simultaneously.
gradienwe
When t tells
makeehow
mak thetoupup dateunexp
update,
date, eachected
parameter,
unexpected resultsunder
can happ the en
happen assumption
because man that
many the other
y functions
lay
compers
composed do not change.
osed together are changed sim In practice, w e up
simultaneously
ultaneouslydate all
ultaneously,, using up of the
dates that were computed.
updates lay ers simultaneously
When
under we themak e the update,
assumption thatunexp
the ected
other results can remain
functions happen constant.
because man Asy functions
a simple
comp osed
example, supp together
suppose are
ose we hav changed sim ultaneously
havee a deep neural netw ,
network using up dates that
ork that has only one unit per lay were computed
layerer
under
and dodoesthe assumption
es not use an activ that
activation the other functions
ation function at eac each remain
h hidden lay constant.
layer: As a simple
er: yŷˆ = xw 1w2 w3 . . . wl .
example,
Here, w i pro supp ose we hav e a deep neural er i. The output ofonly
netw ork that has erone
i isunit h i =pher lay er
providesvides the weigh
weightt used by laylayer lay
layer i−1 wi .
and do
The es notyŷˆ use
output is aan activfunction
linear ation function
of the at eachxhidden
input , but a lay er: yˆ = xw
nonlinear w w of
function w.
. . .the
Here,
w eigh tsw wpro
eights vides the weight used by layer i. Thegradient
i . Suppose our cost function has put a gradien
outputt of of lay
1 on er yŷˆi,issohwe=wish
h w to.
The output
decrease yˆ istly
yŷˆ sligh
slightlya .linear
tly. The bacfunction of the input
back-propagation
k-propagation x, butcan
algorithm a nonlinear
then compute function of the
a gradient
w eigh ts w . Suppose
g = ∇wyŷˆ. Consider what happ our cost function
happens has put
ens when we mak a gradien
makee an up t of
update 1 on y
ˆ ,
date w ← w − g. The so we wish to
decrease yˆ slightly. The back-propagation algorithm can then compute a gradient
g= yˆ. Consider what happens when 318we make an up date w w g. The
∇ ← −
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

first-order Taylor series appro ximation of yŷˆ predicts that the value of yŷˆ will decrease
approximation
by g>g . If we wan wanted ted to decrease yŷˆ by .1, this first-order information av available
ailable in
first-order T aylor series appro ximation of y
ˆ
the gradient suggests we could set the learning rate  to g> g . Ho predicts that the.1 value of y
Howev
wev ˆ
wever, will
er, the decrease
actual
b
upy 
updateg g ted to decrease y
ˆ b y .
date will include second-order and third-order effects, on up to effects of orderinl .
. If w e wan 1 , this first-order information av ailable
The gradient
the new value suggests
of yŷˆ is w e could
given by set the learning rate  to . However, the actual
update will include second-order and third-order effects, on up to effects of order l .
The new value of yˆ is given x(w1 by− g1 )( )(ww2 − g 2) . . . (w l − gl ). (8.40)
Ql
x(w g term
)(w arising g ) . .from
. (w this g up). date is 2 g1 g2 (8.40)
An example of one second-order
Ql
update i=3 wi .
This term might b e negligible − if i=3 w −i is small, or−might be exp exponen
onen
onentially
tially large
An example of one second-order term arising from this update is  g g w.
if the weigh
weights ts on laylayersers 3 through l are greater than 1. This makes it very hard
This
to cho term
ose might
hoose b e negligible
an appropriate if
learning w is
rate, small, or
because themight
effects beofexpanonen
up tiallyto
update
date large
the
if the weigh ts
parameters for one la on lay ers
layer 3
yer depthrough
depends l are greater than
ends so strongly on all of the other lay 1 . This makes
layers. it very
ers. Second-order hard
to choose an algorithms
optimization appropriate learning
address thisrate,
issue bbecause
y computingthe effects
an up of anthat
update
date updatetak estothese
takes the
parameters forinteractions
one layer dep ends so strongly on can
all ofsee
thethatother Q
second-order into accoun
account, t, but we in lay ers.deep
very Second-order
netw
networks,
orks,
optimization algorithms address Q
this issue b y computing an up date that tak es these
ev
even
en higher-order interactions can be significan significant. t. Ev
Evenen second-order optimization
second-order
algorithms are exp interactions
expensive into accoun t, but we
ensive and usually require numerous appro can see that ximations
in very deep
approximations thatnetw
prevorks,
prevent
ent
even higher-order
them from truly accountinginteractions forcanall bsignificant
e significansecond-order
t. Even second-orderinteractions.optimization
Building
algorithms
an n-th order are optimization
expensive andalgorithm
usually requirefor n >numerous
2 thus seems approhop ximations
hopeless.
eless. Whatthat prev
can ent
we
them from
do instead? truly accounting for all significant second-order interactions. Building
an n-th order optimization algorithm for n > 2 thus seems hopeless. What can we
Batc
Batch
do instead? h normalization provides an elegant wa wayy of reparametrizing almost any deep
net
netwwork. The reparametrization significantly reduces the problem of co coordinating
ordinating
up Batc
updates h normalization
dates across man many provides
y lay
layers.
ers. Batc an
Batch elegant wa y of reparametrizing
h normalization can be applied to any almost anyinput
deep
nethidden
or work. The la
lay yerreparametrization
in a netwnetwork.
ork. Let significantly
H be a minibatc reduceshthe
minibatch of problem
activ
activations of co
ations ofordinating
the lay
layerer
up dates across man y lay ers. Batc h
to normalize, arranged as a design matrix, with the activ normalization can b e
activations applied
ations for eac to
each any input
h example
or
app hidden
appearing
earing in layaerrow in aofnetwthe ork.
matrix. LetTH be a minibatc
o normalize H , weh ofreplace
activations
it withof the layer
to normalize, arranged as a design matrix, with the activations for each example
appearing in a row of the matrix. T0o normalize H − µ H , we replace it with
H = , (8.41)
σ
H µ
where µ is a vector con containing H
taining the mean ofσ− = each, unit and σ is a vector con (8.41)
containing
taining
the standard deviation of eac eachh unit. The arithmetic here is based on broadcasting
where
the vectorµ isµa andvector thecon taining
vector σ to thebemeanappliedof each unitrow
to every andofσ the is amatrix
vectorH con
. taining
Within
the
eac
each standard deviation
h row, the arithmetic is elemen of eac h unit. The arithmetic here is based
t-wise, so Hi,j is normalized by subtracting µj
element-wise, on broadcasting
and dividing by σj . The rest of theapplied
the vector µ and the v ector σ to b e netw
networkorktothen
everyop row
operates
eratesof the on matrix H . Within
H0 in exactly the
eac h
same wa row,
way the arithmetic
y that the original netw is elemen
network t-wise,
ork op operated so H
erated on H . is normalized by subtracting µ
and dividing by σ . The rest of the network then operates on H in exactly the
At training time,
same way that the original network op1erated X on H .
At training time, µ = H i,: (8.42)
m
1 i
µ= H (8.42)
m
319

X
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

and s
1X
and σ= δ+ (H − µ)2i , (8.43)
m
1 i
σ= δ+ (H µ) , (8.43)
where δ is a small positive value such m as 10−8 imp imposedosed to av avoid
oid encountering the
√ −
undefined gradient of z at z = 0. Crucially Crucially,, we back-propagate through
where
these δop iserations
a small pfor
operations ositive value such
computing the meanas 10 and imp theosed to avoiddeviation,
standard encountering and the for
undefined gradient of z at zs = 0 . Crucially , w e back-propagate through
applying them to normalize √ H . This means X that the gradient will never prop propose ose
these op erations
an operation that acts simply to increase the standard deviation or meanfor
for computing the mean and the standard deviation, and of
applying
h ; the them to normalize
normalization op H . This
operations
erations remov
removemeans
e thethat the of
effect gradient
such will
an neverand
action propzero ose
i
an operation
out its comp
component that acts
onent in the simply
gradient.to increase
This was thea standard
ma
majorjor innovdeviation
ation oforthe
innovation mean batchof
h ; the normalization
normalization approach. operations
Previousremov e the effect
approaches had inv of olved
such adding
involved an action and zero
penalties to
out its comp onent in the
the cost function to encourage units to ha gradient. This w
have as a ma jor
ve normalized activ innov ation
activation of the
ation statistics or batch
normalization
in
inv olv
volved interv
ed interveningapproach.
ening to Previous
renormalize approaches
unit statistics hadafterinv olved
each adding
gradient penalties
descen
descent to
t step.
the cost
The former function
approach to encourage units to
usually resulted in ha
impveerfect
normalized
imperfect activation
normalization and statistics
the latter or
involvedresulted
usually intervening to renormalize
in significant wasted unittimestatistics
as theafter each algorithm
learning gradient descen rep t step.
repeatedly
eatedly
The
prop
proposedformer approach usually resulted in imp
osed changing the mean and variance and the normalization step reperfect normalization and the latter
repeatedly
eatedly
usuallythis
undid resulted
change. in Batch
significant wasted time
normalization as the learning
reparametrizes the algorithm
mo
modeldel to mak repeatedly
make e some
proposed
units alwa
alwayschanging the mean and
ys be standardized variance and
by definition, deftlythesidestepping
normalization both step repeatedly
problems.
undid this change. Batch normalization reparametrizes the model to make some
Atalwa
testystime, µ and σ ma may ybbe replaced bydeftly running averages that
units be standardized y definition, sidestepping bothwproblems.
ere collected
during training time. This allows the mo modeldel to be ev evaluated
aluated on a single example,
withoutAt test time, to
needing µ and σ may be replaced
use definitions of µ and by σ running
that dep aend
depend verages
on an that wereminibatc
entire collected
minibatch. h.
during training time. This allows the model to be evaluated on a single example,
withoutRevisiting
needing thetoyŷˆuse
= xw 1 w2 . . . w lofexample,
definitions µ and σwethat see depthatendweoncananmostly
entire resolv
resolve
minibatc e theh.
difficulties in learning this mo modeldel by normalizing h l−1. Supp Suppose ose that x is drawn
from Revisiting the yˆ = xw
a unit Gaussian. w . .h. w example,
Then we see that we can mostly resolve the
l−1 will also come from a Gaussian, b ecause the
difficulties in learning
transformation from x to this
hl mo del by Ho
is linear. Howevwev
wever,er, hl−1h will
normalizing . Supp ose that
no longer haveexzero
hav is drawn
mean
from a unit Gaussian.
and unit variance. After applying batc Then h will
batch also come from a Gaussian,
h normalization, we obtain the normalized b ecause the
transformation
ˆ

h from x to h is linear. However, h will noerties.
longer F hav
or ealmost
zero mean
l−1 that restores the zero mean and unit variance prop properties. any
and
up
update unit variance.
date to the lo low After
wer lay applying
ers, h
layers, ˆ batc h normalization, we obtain
ĥl−1 will remain a unit Gaussian. The output yŷˆ ma the normalized may y
ˆ
h that restores
then b e learned as the zero linear
a simple mean function
and unityŷˆv= ariance
ˆĥ l−1prop
wl h erties. in
. Learning For almost
this mo
model del any
is
ˆ
up
no
now w date
verytosimple
the low er layers,
because thehparameters
will remain at the a unit
lo werGaussian.
lower la
layers
yers simplyThe do output
not hav yˆ ma
have e any
ˆ
then bin
effect e learned
most cases; as a simple linear is
their output function
alw
alwa aysyˆrenormalized
= w h . Learning to a unit in this model In
Gaussian. is
no w very simple
some corner cases, the low b ecause the
lower parameters
er lay
layers
ers can hav at the lo wer la yers simply
havee an effect. Changing one of the lo do not hav e
low an
wer
effect
la
lay in
yer weigh most
weights cases;
ts to 0 can mak their output is alw a ys renormalized
makee the output become degenerate, and changing the sign to a unit Gaussian. In
some
of onecorner
of thecases,
low
lowerertheweighlowtsercan
weights layersflipcan thehav e an effect.betw
relationship Changing
een ˆ
etween ĥ l−1one
h andof ythe lower
. These
layer weighare
situations ts to 0 can
very rare.mak e the output
Without become degenerate,
normalization, nearly every andup changing
update
date would the ha sign
havve
of one of the low er weigh ts can flip the relationship b etw een ˆ
h and y . These
an extreme effect on the statistics of hl−1. Batc Batch h normalization has thus made
situations
this mo model are v
del significanery
significantlyrare. Without normalization,
tly easier to learn. In this example, nearly every the upeasedateof would
learning havofe
an extreme
course came effect
at theon costtheofstatistics
making the of hlolow
wer. Batc
lay
layers h normalization
ers useless. In our has linearthus made
example,
this model significantly easier to learn. In this example, the ease of learning of
course came at the cost of making the lo wer layers useless. In our linear example,
320
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

the low
lowerer lay
layers
ers no longer hav havee anany y harmful effect, but they also no longer hav havee
an
any y beneficial effect. This is because we hav havee normalized out the first and second
the low
order er layers whic
statistics, no longer
which h is allhav thate ana ylinear
harmfulnetw effect,
networkork can but they alsoInnoa deep
influence. longerneural
have
an
net
netwywbork
eneficial effect. This
with nonlinear activisation
because
activation we havthe
functions, e normalized
lo
lower
wer lalay
yers outcantheperform
first and second
nonlinear
order statistics, whic h is all that a linear netw ork
transformations of the data, so they remain useful. Batch normalization acts can influence. In a deep neural to
network withonly
standardize nonlinear
the mean activ ation
and functions,
variance the lo
of each werin
unit layorder
ers can to pstabilize
erform nonlinear
learning,
transformations
but of the data,bso
allows the relationships etw they
etween remain
een units and useful. Batch normalization
the nonlinear statistics of aacts singleto
standardize
unit to change. only the mean and variance of each unit in order to stabilize learning,
but allows the relationships between units and the nonlinear statistics of a single
Because the final lay layer
er of the netw network ork is able to learn a linear transformation,
unit to change.
we mamay y actually wish to remov removee all linear relationships betw etween
een units within a
la
lay Because the final lay er of the netw ork is
yer. Indeed, this is the approach taken by Desjardins et al. (2015),able to learn a linear transformation,
who provided
w e ma y actually
the inspiration for batc wish to
batch remov e all linear
h normalization. Unfortunately relationships b etw een
Unfortunately,, eliminating all units within
linear a
la
iny er. Indeed,
interactions
teractions is muc this
muchis the approach
h more exp expensiv
ensiv taken by Desjardins et al. (
ensivee than standardizing the mean and standard2015 ), who provided
the inspiration
deviation of eac
each hfor batch normalization.
individual unit, and so far Unfortunately
batch normalization , eliminating
remains all thelinear
most
in teractions
practical approach. is muc h more exp ensiv e than standardizing the mean and standard
deviation of each individual unit, and so far batch normalization remains the most
Normalizing the mean and standard deviation of a unit can reduce the expressiv expressivee
practical approach.
power of the neural net network
work con containing
taining that unit. In order to main maintaintain the
Normalizing
expressiv
expressive e powowererthe
of mean
the netwandork,
network, standard deviation
it is common toofreplace
a unit canthe reduce
batch of thehidden
expressivunite
p ower
activ
activationsof the
ations H withneural γ Hnet0+ work
β rathercontaining
than simplythat theunit. In orderHto
normalized 0 maintain the
. The variables
expressiv e p ow er of the netw ork, it is common
γ and β are learned parameters that allow the new variable to ha to replace the batch ofvehidden
have any mean unit
activ ations H with γ H + β rather
and standard deviation. At first glance, this ma than simply maythe normalized H .
y seem useless—why did we setThe v ariables
γ and
the mean β areto 0learned
, and thenparameters
in
intro
tro
troduce
duce that allow the new
a parameter that vallo
ariable
allows ws ittotoha bevesetany bacmean
backk to
and
an
any standard deviation.
y arbitrary value β ? The answ At first
answer glance, this ma y seem useless—why
er is that the new parametrization can represent did we set
the same
the to 0 , and
mean family then intro
of functions ofduce
the inputa parameter
as the old that allows it to be but
parametrization, set the
backnew to
an y arbitrary v alue
parametrization has differen β ? The answ er is that the new parametrization
differentt learning dynamics. In the old parametrization, the can represent
the same
mean of Hfamily of functionsby
was determined of athe input as the
complicated in old parametrization,
interaction
teraction betw
etweeneen the but the new
parameters
parametrization
in the lay
layers
ers below has H differen
. In tthe learning dynamics. In thethe
new parametrization, oldmeanparametrization,
of γ H 0 + βthe is
mean of H w as determined by a complicated
determined solely by β . The new parametrization is muc in teraction muchb etw een the parameters
h easier to learn with
in the
gradien lay ers
gradientt descent. b elow H . In the new parametrization, the mean of γ H + β is
determined solely by β . The new parametrization is much easier to learn with
Most neural net network
work lay ers take the form of φ(X W + b) where φ is some
layers
gradient descent.
fixed nonlinear activ activation
ation function suc such h as the rectified linear transformation. It
Most neural net work lay
is natural to wonder whether we should apply ers take the of φ(Xnormalization
form batch W + b) wheretoφthe is some
input
fixed nonlinear activ ation function suc h as the rectified
X , or to the transformed value X W + b . Ioffe and Szegedy (2015) recommend linear transformation. It
is natural
the latter. to Morewonder
sp whether
specifically
ecifically
ecifically, , Xwe W should
+ b should apply bebatch
replacednormalization
by a normalized to theversion
input
X ,XorWto. the
of Thetransformed
bias term should value X beW + b . Ioffe
omitted and Szegedy
because it becomes (2015 ) recommend
redundant with
the latter. More sp ecifically
the β parameter applied by the batc , X W
batch+ b should b e replaced by a normalized
h normalization reparametrization. The input version
of X
to a layW er is usually the output of omitted
.
layer The bias term should b e a nonlinear becauseactiv it
activationbecomes
ation functionredundant
such aswith the
the β parameter
rectified applied by
linear function in athe batch normalization
previous lay
layer. reparametrization.
er. The statistics of the inputThe areinput
thus
to a layer is usually the output of a nonlinear activation function such as the
rectified linear function in a previous layer. The statistics of the input are thus
321
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

more non-Gaussian and less amenable to standardization by linear op


operations.
erations.
In conv
convolutional
olutional net
netw
works, describ
described
ed in Chapter 9, it is imp
important
ortant to apply the
more non-Gaussian and less amenable to standardization by linear operations.
same normalizing µ and σ at every spatial lo location
cation within a feature map, so that
In conv olutional netw orks, describ ed in Chapter 9, it is imp
the statistics of the feature map remain the same regardless ofortant
spatialtolo
apply the
location.
cation.
same normalizing µ and σ at every spatial location within a feature map, so that
the statistics of the feature map remain the same regardless of spatial location.
8.7.2 Co
Coordinate
ordinate Descent

8.7.2
In some Co ordinate
cases, it may b Descent
e possible to solve an optimization problem quickly by
breaking it into separate pieces. If we minimize minimizef f (x) with resp
respect
ect to a single variable
In some cases, it may b
xi , then minimize it with resp e p ossible
ect to another variable xj andproblem
respect to solve an optimization quickly
so on, rep
repeatedlyby
eatedly
breaking it into separate pieces. If we minimize f (x )
cycling through all variables, we are guaranteed to arrive at a (lowith resp ect to a single
(local) v ariable
cal) minimum.
x , then
This minimize
practice is knownit with
as coresp
or ect
ordinate to another
desc
dinate descent ent
ent,, b v ariable
ecause we x and
optimize so on,
one rep
co eatedly
coordinate
ordinate
cycling
at through
a time. Moreallgenerally
variables,
generally, we
, blo
blockckare
coorguaranteed
ordinate
dinate desc to
descent arrive
ent refersatto
a (lo cal) minimum.
minimizing with
This
resp
respectpractice is known as co ordinate desc
ect to a subset of the variables simultaneously ent , b ecause we optimize
simultaneously.. The term “co one
“coordinate co ordinate
ordinate descen
descent”
t”
at a time. More generally
is often used to refer to blo , blo
blocck co ck coor
coordinate dinate desc ent refers to minimizing
ordinate descent as well as the strictly individual with
resp
co ect
coordinateto
ordinate a subset
descent. of the v ariables simultaneously. The term “coordinate descent”
is often used to refer to block coordinate descent as well as the strictly individual
Co
Coordinate
ordinate
coordinate descent makes the most sense when the different variables in the
descent.
optimization problem can be clearly separated into groups that pla play
y relatively
Co ordinate descent makes the most sense when the different
isolated roles, or when optimization with respect to one group of variables v ariables in the
is
optimization
significan
significantly problem
tly more can than
efficient be clearly separated
optimization withinto groups
resp ect tothat
respect all ofpla y relatively
the variables.
isolated
F roles,consider
or example, or whentheoptimization
cost functionwith respect to one group of variables is
significantly more efficient than optimization with respect to all of the variables.
X X 2
For example, consider the
J (H , W ) =cost function
|Hi,j | + X−W H >
. (8.44)
i,j
i,j i,j
J (H , W ) = H + X W H . (8.44)
This function describ
describes es a learning |problem | called sparse
− co
coding,
ding, where the goal is
weightt matrix W that can linearly deco
to find a weigh decode
de a matrix of activ
activation
ation values
This
H tofunction describ es a learning problem
X called sparse
reconstruct the training set . Most applications of sparse co co ding, where the
coding
dinggoal is
also
tovfind
in
involvee aweigh
olv weigh
weightt tdeca
decay y or W
matrix thatXcan linearly
a constraint on the X deco
 deofathe
norms matrix of activ
columns of Wation
, invorder
alues
H preven
to to reconstruct
prevent the training
t the pathological set X
solution . Most
with applications
extremely small H of and
sparse
largecoW ding. also
involve weight decay or a constraint on the norms of the columns of W , in order
The function J is not conv convex.
ex. HoHow wev
ever,
er, wwee can divide the inputs to the
to prevent the pathological solution with extremely small H and large W .
training algorithm in to two sets: the dictionary parameters W and the code
into
The
represen function J is not convthe
tations H . Minimizing
representations ex. obHo weveer,
objectiv
jectiv
jective we can
function divide
with resp the toinputs
respect
ect either toonethe
of
training algorithm in to
these sets of variables is a conv two sets:
convex the dictionary
ex problem. Block co parameters
coordinate W and the
ordinate descent thus gives code
represen tations H . Minimizing the ob jectiv
us an optimization strategy that allows us to use efficiente function with resp ectex
conv
convex to optimization
either one of
these sets ofby
algorithms, variables
alternating is a bconv
etw ex problem.
etween
een optimizing Block coordinate
W with H fixed, descent thus gives
then optimizing
us an
H withoptimization
W fixed. strategy that allows us to use efficient convex optimization
algorithms, by alternating between optimizing W with H fixed, then optimizing
Co
Coordinate
ordinate descent is not a very go goood strategy when the value of one variable
H with W fixed.
strongly influences the optimal value of another variable, as in the function f (x ) =
Coordinate descent is not a very good strategy when the value of one variable
strongly influences the optimal value of 322 another variable, as in the function f (x ) =
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

 
(x1 − x 2 )2 + α x21 + x22 where α is a positive constan constant.
t. The first term encourages
the tw
two o variables to hahavve similar value, while the second term encourages them
(to
x be xnear
) +zero.
α x The
+ xsolution
whereisα tois set
a positive
both toconstan
zero. t. The firstmethod
Newton’s term encourages
can solve
the−tw o v ariables to ha v e similar value, while the second term encourages
the problem in a single step because it is a positive definite quadratic problem. them
to
Ho
Howb e
wev near zero. The solution
er, for small α, co
ever, coordinate is to set b oth
ordinate descent will mak to
makee very slow progress becausesolve
zero. Newton’s method can the
the problem
first term dodoesin not
es a single
allo
allow
w step
a because
single v it is atopositive
ariable b e definite
changed to quadratic
a value problem.
that differs
However, tly  α, coordinate descent will make very slow progress because the
for from
small
significan
significantly the current value of the other variable.
first term does not allow a single variable to be changed to a value that differs
significantly from the current value of the other variable.
8.7.3 Polyak Averaging

8.7.3
P oly
olyak Polyak (A
ak averaging veraging
Poly
Polyak
ak and Juditsky, 1992) consists of av averaging
eraging together sev several
eral
poin
oints
ts in the tratrajectory
jectory through parameter space visited b byy an optimization
Polyak averaging
algorithm. (Polyak and
If t iterations Juditsky,descen
of gradient 1992)tconsists
descent of averaging
visit points θ (1) , . . . together
, θ (t), thensevthe
eral
points in the Poly
trajectory through parameter P
output of the Polyakak av
averaging
eraging algorithm θ̂ˆ(space
is θ t) = 1visited
t
(i) by an optimization
i θ . On some problem
algorithm.
classes, t iterations
suchIf as of gradient
gradient descen
descent t applieddescen
to tconvvisit
convex expoints
problems,θ , .this . . , θapproac
, thenh the
approach has
output of the Poly ak av eraging algorithm is θˆ = θ . On some problem
strong conv
convergence
ergence guaran
guarantees.
tees. When applied to neural net netw works, its justification
classes,
is more such as gradient
heuristic, but it descen
performs t applied
well intopractice.
convex problems,
The basicthis idea approac
is that h has
the
strong conv ergence guaran tees. When applied to neural
optimization algorithm may leap back and forth across a valley several times net w orks, its justification
is more ever
without heuristic, but
visiting a pitoint
performs
near the well in practice.
bottom of the vThe alley..basic
alley The idea average is that
of allthe
of
optimization algorithm may leap back and forth P a valley several times
across
the lo
locations
cations on either side should be close to the bottom of the valley though.
without ever visiting a point near the bottom of the valley. The average of all of
the In non-conv
non-convex
locations onexeither
problems, the path
side should be taken by the
close to the boptimization
ottom of thetra trajectory
jectory
valley can be
though.
very complicated and visit man many y different regions. Including poin oints ts in parameter
In non-conv ex problems,
space from the distant past that ma the path
may y be separated from the current jectory
taken b y the optimization tra point bycan be
large
vbarriers
ery complicated and visit
in the cost function do man y
does different regions. Including p oin
es not seem like a useful behavior. As a result, ts in parameter
space applying
when from the distant
Poly ak past
Polyak av that ma
averaging
eraging toynon-conv
be separated
non-convex from the it
ex problems, current
is typical pointtoby uselarge
an
barriers
exp
exponen
onen in
onentially the cost function
tially decaying running av do es not
average:
erage: seem like a useful b ehavior. As a result,
when applying Polyak averaging to non-convex problems, it is typical to use an
exponentially decaying running θ̂ˆ(t) =av
θ θ̂ˆ(t−1) + (1 − α)θ(t).
αerage:
θ (8.45)

θˆ = α θˆ + (1 α)θ . (8.45)
The running average approach is used in numerous applications. See Szegedy
et al. (2015) for a recent example. −
The running average approach is used in numerous applications. See Szegedy
et al. (2015) for a recent example.
8.7.4 Sup
Supervised
ervised Pretraining

8.7.4 Sup
Sometimes, ervised
directly Pretraining
training a mo
model
del to solve a sp
specific
ecific task can be to too
o am
ambitious
bitious
if the mo
model
del is complex and hard to optimize or if the task is very difficult. It is
Sometimes, directly
sometimes more training
effective a moadel
to train to solve
simpler mo adel
modelspto
ecific task
solve thecan be then
task, too am bitious
make the
if the
mo delmo
model del complex.
more is complexItand
can hard toe optimize
also b or ife the
more effectiv
effective task is
to train thevery
mo difficult.
model
del It is
to solve a
sometimes more effective
simpler task, then mov to train a simpler
movee on to confron mo del to solve the task, then
confrontt the final task. These strategies that invmake the
involve
olve
model more complex. It can also be more effective to train the model to solve a
simpler task, then move on to confront the 323 final task. These strategies that involve
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

training simple mo models


dels on simple tasks before confronting the challenge of training
the desired mo model
del to perform the desired task are collectively known as pr pretr
etr
etraining
aining
aining..
training simple models on simple tasks before confronting the challenge of training
the desired model to perform the desired task are collectively known as pretraining.
Gr
Greeedy algorithms break a problem into man many y comp
components,
onents, then solve for the
optimal version of each comp component onent in isolation. UnfortunatelyUnfortunately,, combining the
Greedy algorithms
individually optimal comp break
components a problem
onents is notinto guaranman
guaranteed y comp
teed onents,
to yield an then
optimal solve for the
complete
optimal vHow
solution. ersion
However, of greedy
ever, each comp onent incan
algorithms isolation. Unfortunately
be computationally muc
much, hcombining
cheaper than the
individuallythat
algorithms optimal
solv comp
solvee for theonents is not
b est join
joint guaranand
t solution, teedthe to quality
yield anofoptimal
a greedycomplete
solution
solution. How ever, greedy algorithms
is often acceptable if not optimal. Greedy algorithms ma can b e computationallymay muc h
y also be follo cheaper
wed than
followed by a
algorithms that
fine-tuning stagesolv
in ewhich
for thea bjoinest tjoin
joint t solution, and
optimization the quality
algorithm of a greedy
searches solution
for an optimal
is often acceptable
solution to the full if not optimal.
problem. Greedy
Initializing thealgorithms may also algorithm
joint optimization be followed by aa
with
fine-tuning
greedy stagecan
solution in greatly
which asp join
eedt optimization
speed it up and improv algorithm
improve searches
e the quality of for
the an optimal
solution it
solution
finds. to the full problem. Initializing the joint optimization algorithm with a
greedy solution can greatly speed it up and improve the quality of the solution it
Pretraining, and esp especially
ecially greedy pretraining, algorithms are ubiquitous in
finds.
deep learning. In this section, we describ describee sp specifically
ecifically those pretraining algorithms
Pretraining,
that break sup and
supervised esp ecially greedy
ervised learning problems in pretraining,
to otheralgorithms
into are ubiquitous
simpler supervised learningin
deep learning.
problems. ThisInapproach
this section,is knownwe describ greeeedy
as gr specifically
sup
supervise
ervise
ervised those
d pr
pretrpretraining
etr
etraining
aining
aining.. algorithms
that break supervised learning problems into other simpler supervised learning
In the original (Bengio et al., 2007) version of greedy sup supervised
ervised pretraining,
problems. This approach is known as greedy supervised pretraining.
eac
each h stage consists of a sup supervised
ervised learning training task inv involving
olving only a subset of
the layIn the
layers original ( Bengio
ers in the final neural net et al. ,
network. 2007 ) version of
work. An example of greedy sup greedy sup ervised
ervisedpretraining,
supervised pretraining
eac h stage consists of a sup
is illustrated in Fig. 8.7, in which eac ervised learning
each training
h added hidden la task
layyer is pretrainedaas
inv olving only subset of
part of
athe layers in
shallow sup the final neural
supervised
ervised MLP network.
MLP,, taking as An
input example of greedy
the output suppreviously
of the ervised pretraining
trained
is illustrated
hidden lay layer. in Fig. 8.7 , in which
er. Instead of pretraining one lay eac h added
layer hidden la y
er at a time, Simonyer is pretrained
Simonyan as
an and Zisserman part of
(a2015
shallow supervised
) pretrain a deepMLP conv , olutional
taking asnetw
convolutional input
network orkthe output
(eleven of thelay
weight previously
layers)
ers) and then trained
use
hidden lay er. Instead
the first four and last three la of pretraining
layers one lay
yers from this netw er at a time,
network Simony an
ork to initialize ev and Zisserman
even
en deep
deeper
er
(net
2015
netw ) pretrain
works (with aupdeep convolutional
to nineteen lay
layersersnetw ork (eleven
of weigh
weights).ts). The weight
middle laylay
ers)ersand
layers thennew,
of the use
vthe
ery first
deep four
net
networkandare
work lastinitialized
three layers randomlyfrom. this
randomly. The newnetwnetw
ork ork
networkto initialize
is then join evtly
jointlyen trained.
deeper
networksoption,
Another (with explored
up to nineteen by Yu et layal.ers(2010
of weigh ts).use
) is to Thethemiddle
outputslay oferstheofpreviously
the new,
vtrained
ery deepMLPs, network are initialized
as well as the ra raww input, as inputs for each added stage. trained.
randomly . The new netw ork is then join tly
Another option, explored by Yu et al. (2010) is to use the outputs of the previously
Wh
Why y wwould
ould greedy supervised pretraining help? The h hyp
yp
ypothesis
othesis initially
trained MLPs, as well as the raw input, as inputs for each added stage.
discussed by Bengio et al. (2007) is that it helps to provide better guidance to the
in Why would
intermediate
termediate greedy
levels supervised
of a deep hierarc
hierarch pretraining
hy. In general, help? The hyp
pretraining ma
mayothesis
y help initially
both in
discussed by Bengio et al. ( 2007 )
terms of optimization and in terms of generalization. is that it helps to provide b etter guidance to the
intermediate levels of a deep hierarchy. In general, pretraining may help both in
An approach related to sup supervised
ervised pretraining extends the idea to the context
terms of optimization and in terms of generalization.
of transfer learning: Yosinski et al. (2014) pretrain a deep conv convolutional
olutional net with 8
la
layyers of weights on a set of tasks (a subset of the 1000 ImageNetidea
An approach related to sup ervised pretraining extends the ob to the
object
ject context
categories)
of transfer
and learning:a Y
then initialize osinski etnetw
same-size al. (ork
network 2014with) pretrain
the first a deep
k la yconv
lay ers ofolutional
the firstnet net.with
All8
layers
the layof
ersweights
layers of the onseconda setnetw
of tasks
network ork (a subset
(with the ofupp theer1000
upper lay
layers ImageNet
ers initialized ob ject categories)
randomly) are
and then initialize a same-size network with the first k layers of the first net. All
the layers of the second network (with 324 the upper layers initialized randomly) are
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

U(1)

h(1) h(1)

W (1) W (1) U(1) y

x x

(a) (b)

U(2)

h(2) h(2)

W (2) U(2) y W (2)

h(1) h(1)

W (1) U(1) y W (1) U(1) y

x x

(c) (d)

Figure 8.7: Illustration of one form of greedy sup supervised


ervised pretraining (Bengio et al., 2007).
(a) We start by training a sufficiently shallo shalloww arc
architecture.
hitecture. (b) Another drawing of the
Figurearchitecture.
same 8.7: Illustration
(c) W ofe one
keepform
onlyofthe
greedy supervised lay
input-to-hidden pretraining
er of the (original
layer et al.ork
Bengio netw
network , 2007
and).
(a) W e start by training
discard the hidden-to-output la a sufficiently
lay shallo w architecture. (b) Another
yer. We send the output of the first hidden la drawing
lay of the
yer as input
same
to architecture.
another (c)
supervised W e keep
single only the
hidden la input-to-hidden
layer
yer layer of with
MLP that is trained the original
the samenetw obork
objectivande
jectiv
jective
discard
as the hidden-to-output
the first netw ork was, thuslayadding
network er. Weasend the hidden
second output of
laythe
er. first
layer. Thishidden
can b elarep
yereated
as input
repeated for
to another
as many lay supervised
layers single hidden layer MLP that is
ers as desired. (d) Another drawing of the result, viewtrained with ed as a feedforwarde
viewedthe same ob jectiv
as the
net
network.
work.firstTonetw ork was,
further impro thus
improve adding
ve the a second hidden
optimization, we canlay er. This
jointly can b eallrep
fine-tune eated
the lay for
layers,
ers,
as many
either layat
only ersthe
as end or at (d)
desired. eachAnother
stage ofdrawing
this pro of the result, viewed as a feedforward
process.
cess.
network. To further improve the optimization, we can jointly fine-tune all the layers,
either only at the end or at each stage of this process.

325
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

then jointly trained to perform a different set of tasks (another subset of the 1000
ImageNet ob object
ject categories), with fewer training examples than for the first set of
then jointly trained
tasks. Other approaches to perform
to transfer a different
learning setwith
of tasks
neural (another
netw
networks subset
orks of the 1000
are discussed in
ImageNet
Sec. 15.2. ob ject categories), with fewer training examples than for the first set of
tasks. Other approaches to transfer learning with neural networks are discussed in
Another related line of work is the FitNets (Romero et al., 2015) approac approach. h. This
Sec. 15.2.
approac
approach h begins by training a net network
work that has lo low
w enough depth and great enough
width Another
(num
(numb brelated line poferwlay
er of units orker)
layer) is the
to bFitNets
e easy to (Romero et al.netw
train. This , 2015
network ) approac
ork then bh. This
ecomes
aapproac
te
teacher
acher h begins by training
for a second netw
network,aork,
network that has
designated thelow enough
student depth
. The and great
studen
student t net enough
network
work is
width
mucuch (num
h deep
deeper b er of units p
er and thinner (elev er lay
(eleven er) to b e easy
en to nineteen lay to train.
layers) This netw ork then
ers) and would be difficult to train b ecomes
a teacher
with SGDfor undera second
normal netw ork, designated
circumstances. The the studentof
training . The student netw
the student network
network
ork is is
much deep
made easierer by
andtraining
thinner the (elevstudent
en to nineteen
netw orklaynot
network ers)only
and to would be difficult
predict to train
the output for
with SGD under normal circumstances. The
the original task, but also to predict the value of the middle la training of the student
lay netw ork
yer of the teacher is
made
net
netw easier by training the student
work. This extra task provides a set of hints ab netw ork not only
about to predict the
out how the hidden lay output for
layers
ers
the original
should be used task,
andbutcanalso to predict
simplify the value ofproblem.
the optimization the middle layer of the
Additional teacher
parameters
netwintroduced
are ork. This to extra task the
regress provides
middlea lay seterofofhints
layer abouter how
the 5-lay
5-layer teac the netw
teacher
her hiddenorklay
network ers
from
should
the be used
middle lay
layerand
er of can
the simplify
deep
deeper the optimization
er student netw ork. problem.
network. How ever,Ainstead
However, dditional of parameters
predicting
are introduced to regress
the final classification target, the ob the middle lay
objectiv
jectiver of the 5-lay er teac her
jectivee is to predict the middle hidden netw ork lay
from
layer
er
thethe
of middle
teac
teacher layer
her netwof ork.
network.the deep
The er low student
lower er lay
layersnetw
ers ork.studen
of the However,
student t netw instead
networks
orks th ofuspredicting
thus hav
havee twtwo o
the
ob final
objectiv
jectiv classification
jectives: target, the
es: to help the outputs of the studen ob jectiv e is to
studentt netw predict
network the middle hidden
ork accomplish their task, as lay er
of the teac her netw
well as to predict the in ork. The
intermediate low
termediate lay er lay
layerers of the studen
er of the teacher net t netw
netw work.orks thus havae thin
Although two
ob jectiv es:
and deep netw to
networkhelp
ork appthe
appearsoutputs of the studen t netw ork accomplish
ears to be more difficult to train than a wide and shallow their task, as
w
netellwas
netw tothe
ork, predict
thin the
and in termediate
deep net
netw worklay er of
may the teacher
generalize betternetand
work. Although
certainly hasalowthin
lowerer
and deep netw ork app ears to
computational cost if it is thin enough to ha b e more difficult
hav to train than a wide
ve far fewer parameters. Without and shallow
net w ork, the thin
the hints on the hidden layand deep net
layer, w ork may generalize
er, the student netw network orkbperforms
etter and v certainly
ery poorly hasinlow er
the
computational
exp
experimen
erimen
eriments, cost if it is thin enough
ts, both on the training and test set. Hin to ha v e far
Hints fewer parameters.
ts on middle lay layers Without
ers may th thus
us
bthe hints
e one of on
thetheto
toolshidden
ols to help laytrain
er, the student
neural net
netwwnetw
orksork thatperforms
otherwise veryseempoorly in the
difficult to
exp erimen ts, b oth on the training and test
train, but other optimization techniques or changes in the arc set. Hin ts on middle lay
architectureers may
hitecture may also th us
b e
solv one of the
solvee the problem. to ols to help train neural net w orks that otherwise seem difficult to
train, but other optimization techniques or changes in the architecture may also
solve the problem.
8.7.5 Designing Mo
Models
dels to Aid Optimization

8.7.5
To impro
improv vDesigning Mothe
e optimization, dels
bestto Aid Optimization
strategy is not alw
always
ays to impro
improvve the optimization
algorithm. Instead, many improvimprovements
ements in the optimization of deep mo models
dels hav
havee
To impro ve optimization,
come from designing the mo the b
modelsest strategy is not always
dels to be easier to optimize. to impro ve the optimization
algorithm. Instead, many improvements in the optimization of deep models have
comeIn from
principle, we could
designing the mo use activ
activation
dels to ation functions
be easier that increase and decrease in
to optimize.
jagged non-monotonic patterns. How Howevev
ever,
er, this would make optimization extremely
In principle,
difficult. we could
In practice, it is use
moreactiv ation
imp
importantfunctions
ortant that
to cho oseincrease
hoose a mo
model andfamily
del decrease in
that
jagged
is easynon-monotonic
to optimize thanpatterns. Howaevper,
to use ow this
owerful would
erful make optimization
optimization algorithmextremely
. Most
difficult.
of the adv In practice,
advances it is more
ances in neural netw
network imp ortant to c ho ose a mo del
ork learning over the past 30 years ha family
ve that
have been
is easy to optimize than to use a powerful optimization algorithm. Most
of the advances in neural network learning 326 over the past 30 years have been
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

obtained by changing the mo model


del family rather than changing the optimization
pro
procedure.
cedure. Sto
Stochastic
chastic gradient descent with momentum, which was used to train
obtained
neural netby changing
networks
works in thethe model
1980s, familyinrather
remains use inthanmo
modern changing
dern state ofthethe
optimization
art neural
pro
net
netwcedure. Sto chastic
work applications. gradient descent with momentum, which was used to train
neural networks in the 1980s, remains in use in modern state of the art neural
Sp
Specifically
ecifically
ecifically,, momodern
dern neural net netw works reflect a design choic choicee to use linear trans-
network applications.
formations betw etween
een laylayers
ers and activactivation
ation functions that are differen differentiable
tiable almost
ev Sp ecifically
everywhere ,
erywhere and hav mo dern neural
havee significant slop net w orks reflect a design choic
slopee in large portions of their domain. e to use linearIntrans-
par-
formations
ticular, mo b
model etw een
del innov lay
innovations ers and activ ation functions that are
ations like the LSTM, rectified linear units and maxout units differen tiable almost
ev
ha
haverywhere
ve all mov
movedand
ed to hav
towward e significant
using moreslop e infunctions
linear large portions of their mo
than previous domain.
models
dels like Indeep
par-
ticular,
net
netwworksmo del innov
based ations likeunits.
on sigmoidal the LSTM,These mo rectified
models
dels hav linear
have e nice units
propand
propertiesmaxout
erties units
that make
have all movedeasier.
optimization towardThe using moret linear
gradien
gradient flo ws functions
flows through man thany previous
many la
layer
yer models like
yerss provided thatdeepthe
net w orks based on sigmoidal units. These mo dels hav
Jacobian of the linear transformation has reasonable singular values. Moreo e nice prop erties that make
Moreover, ver,
optimization easier. The gradien t flo ws through man
linear functions consistently increase in a single direction, so even if the mo y la yer s provided that
model’sthe
del’s
Jacobian
output is of thefar
very linear
fromtransformation
correct, it is clear has reasonable
simply from singular values.
computing theMoreo
gradientver,
linear
whic
which functionsits
h direction consistently
output should increase
moveeintoareduce
mov single the
direction, so evenInif other
loss function. the mo del’s
words,
output
mo dernisneural
modern very farnetsfromhaveecorrect,
hav it is clear
been designed so simply
that their from locccomputing
lo al gradient the gradient
information
whic h
correspdirection
corresponds its output should
onds reasonably well to movin mov
moving e to
g tow reduce
toward the loss function.
ard a distant solution. In other words,
modern neural nets have been designed so that their local gradient information
Other mo model
del design strategies can help to make optimization easier. For
corresponds reasonably well to moving toward a distant solution.
example, linear paths or skip connections bet etween
ween lay layers
ers reduce the length of
Other mo del design
the shortest path from the low strategies
lowerer laycan
layer’shelp to make optimization
er’s parameters to the output, easier.
and th Fus
or
thus
example, linear paths or skip
mitigate the vanishing gradient problem (Srivconnections b et
Srivastaween
asta
astav lay
va et al. ers reduce the length
al.,, 2015). A related idea of
theskip
to shortest path from
connections the low
is adding er lay
extra er’s of
copies parameters
the outputtothat theareoutput,
attachedandtoththe us
mitigate
in the hidden
intermediate
termediate vanishing la gradient
layers
yers of the problem
netw
network,ork,(asSrivinasta
Go vogLeNet
a et al., (2015
GoogLeNet ). Aetrelated
Szegedy idea)
al., 2014a
to skip
and connections
deeply-sup
deeply-supervised
ervised is adding
nets (Lee extra
et al.copies
, 2014of). the
Theseoutput that are
“auxiliary attached
heads” to the
are trained
intermediate hidden la yers of the netw ork, as in
to perform the same task as the primary output at the top of the netw Go ogLeNet ( Szegedy et
ork in order)
network al. , 2014a
andensure
to deeply-sup ervised
that the lo
lowwernets
lay
layers(Lee
ers et al.,a2014
receive large). gradient.
These “auxiliary heads” is
When training arecomplete
trained
to pauxiliary
the erform theheadssame may task b ase the primary This
discarded. output at the
is an top of the
alternative tonetw
the ork in order
pretraining
to ensure that
strategies, whicthe
which h wloere
werintroduced
layers receive a large
in the gradient.
previous WhenIntraining
section. this waisy,complete
one can
the auxiliary heads
train jointly all the lay may
layers b e discarded. This is an alternative
ers in a single phase but change the architecture, so that to the pretraining
strategies,
intermediatewhic
intermediate lay h
ersw(esp
layers ere ecially
introduced
(especially the low in er
lower theones)
previous
can get section.
some hints In this
ab waywhat
about
out , one they
can
train jointly
should do, viaalla the layers
shorter path.in aThese
single hints
phaseprovide
but changean errorthe signal
architecture,
to low ersolay
lower that
layers.
ers.
intermediate layers (especially the lower ones) can get some hints about what they
should do, via a shorter path. These hints provide an error signal to lower layers.
8.7.6 Con
Continuation
tinuation Metho
Methods
ds and Curriculum Learning

8.7.6
As arguedCon tinuation
in Sec. 8.2.7, manMetho
manyy of theds and Curriculum
challenges Learning
in optimization arise from the global
structure of the cost function and cannot be resolv resolved
ed merely by making better
As argued of
estimates in lo
Sec.
cal 8.2.7
local up
update, man
date y of the challenges
directions. in optimization
The predominant strategyarise
for ofrom the global
vercoming this
structure of the cost function and cannot b e resolved merely by making
problem is to attempt to initialize the parameters in a region that is connected better
estimates
to of local
the solution byupadate
shortdirections. The predominant
path through strategy
parameter space thatfor
lo overcoming
local
cal descent this
can
problem is to attempt to initialize the parameters in a region that is connected
to the solution by a short path through 327parameter space that lo cal descent can
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

disco
discov
ver.
Continuation metho methods ds are a family of strategies that can mak makee optimization
discover.
easier by cho hoosing
osing initial points to ensure that lo local
cal optimization sp spends
ends most of
Continuation
its time in well-b ell-beha metho
eha
ehaved ds are a family of strategies
ved regions of space. The idea behind contin that can mak
continuation e optimization
uation metho
methods ds is
easier by cho osing
to construct a series of ob initial p oints
objective to ensure
jective functions ov that
over lo cal optimization sp
er the same parameters. In order to ends most of
its time ina w
minimize ell-bfunction
cost ehaved regions
J (θ ), weofwillspace. The idea
construct newbcost
ehindfunctions
continuation {J (0),metho
. . . , J ds
(n)}is
.
to construct a series of ob jective functions ov er the
These cost functions are designed to be increasingly difficult, with J being fairly same parameters. (0) In order to
minimize
easy to minimize, and J J
a cost function (n()θ ), wemost
, the will construct
difficult, bnew eingcost ), the trueJcost
J(θfunctions , . function
..,J .
These
motiv cost
motivating functions
ating the entire pro are designed
process. to
cess. When we sab e increasingly
say difficult, with
y that J (i) is easier than J{ b eing fairly
J (i+1) , we }
easy to minimize,
mean that it is well behav and J
ehaved ,
ed ovthe
over most difficult, b eing J (θ ) , the
er more of θ space. A random initialization is moretrue cost function
motiv
lik
likely ating the entire
ely to land in the region wherepro cess. When lo calwedescent
local say that can is easierthe
J minimize thancostJ function , we
mean that itbis
successfully well bthis
ecause ehavregion
ed overis morelarger.of θThespace.
series A ofrandom initialization
cost functions is more
are designed
lik ely to land in the
so that a solution to one is a goregion where
goo lo cal descent
od initial poin can minimize the
ointt of the next. We thus begin by cost function
successfully
solving an easybecause this region
problem is larger.
then refine the The seriestoof solv
solution coste functions
solve incrementally are designed
harder
so that a solution to one is a go o d initial p
problems until we arrive at a solution to the true underlying problem. oin t of the next. W e thus b egin by
solving an easy problem then refine the solution to solve incrementally harder
Traditional contin continuation
uation methomethods ds (predating the use of contin continuation
uation methomethods ds
problems until we arrive at a solution to the true underlying problem.
for neural net network
work training) are usually based on smo smoothing
othing the ob objectiv
jectiv
jectivee function.
See TW raditional
u (1997) contin for anuation
example methoof ds
such (predating
a metho
method the
d and useaofreview
continof uation
somemetho related ds
for neural
metho
methods. ds. net workuation
Contin
Continuation training)metho
methodsaredsusually
are also based on smo
closely othing
related tothe sim ob jectivannealing,
simulated
ulated e function.
See W
whic
which h uadds
(1997 ) fortoanthe
noise example
parametersof such a methodk and
(Kirkpatric
Kirkpatrick et al.a ,review
al., 1983).of Contin
some related
Continuationuation
metho
metho
methods ds.
ds havContin uation metho ds are also closely related
havee been extremely successful in recent years. See Mobahi and Fisher to sim ulated annealing,
(whic
2015h) adds
for annoiseov to the
overview
erview parameters
of recent (Kirkpatric
literature, esp
especially k etforal.AI
ecially , 1983 ). Continuation
applications.
methods have been extremely successful in recent years. See Mobahi and Fisher
(2015 Con
Contintin
tinuation
) for uation
an metho
methods
overview ofdsrecent
traditionally
literature,were mostlyfor
especially designed with the goal of
AI applications.
overcoming the challenge of lo local
cal minima. Sp Specifically
ecifically
ecifically,, they were designed to
reac
reach Con tin uation
h a global minim metho
minimum ds traditionally
um despite the presence of man were mostly
many ydesigned
lo
local with the
cal minima. Togoal of
do so,
overcoming
these contin the challenge
continuation
uation metho
methods dsof localconstruct
would minima. easier Specifically , they were
cost functions designedthe
by “blurring” to
reach a cost
original global minimum
function. Thisdespite
blurring theop presence
operation
eration can of man y local
be done byminima.
approximating To do so,
these continuation methods would construct easier cost functions by “blurring” the
original cost function. This J (iblurring
)
(θ) = Eθop eration
0 ∼N can be 0 done by approximating
(θ 0;θ,σ(i)2 )J (θ ) (8.46)
E
via sampling. The intuition J for (θ)this
= approach is that J (θsome
) non-conv
non-convex ex functions(8.46)
become approximately con convex
vex when blurred. In man many y cases, this blurring preserves
via sampling.
enough information ab The intuition
about
out the lofor this
location approach is that
cation of a global minimum some non-conv
that we ex canfunctions
find the
b ecome
global minimapproximately
minimum con vex when blurred. In man y cases,
um by solving progressively less blurred versions of the problem. This this blurring preserves
enough
approac
approach information
h can break down aboutinthe location
three different of awglobal minimum
ays. First, it might that we can find
successfully the
define
global
a seriesminim
of costumfunctions
by solvingwhere progressively
the firstless blurred
is con vex and
convex versions
the optimof theum
optimum problem.
tracks from This
approac
one h can to
function break
the down in three at
next arriving different ways.minimum,
the global First, it might but itsuccessfully
might require defineso
a
manseries
many of cost functions where the first is
y incremental cost functions that the cost of the entire pro con vex and the optim
procedure um tracks
cedure remains high. from
one function to the next arriving at the
NP-hard optimization problems remain NP-hard, even when contin global minimum, but it might
uationrequire
continuation metho
methods so
ds
many incremental cost functions that the cost of the entire procedure remains high.
NP-hard optimization problems remain 328 NP-hard, even when continuation methods
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

are applicable. The other two ways that con contintin


tinuation
uation metho
methods ds fail both corresp
correspond ond
to the metho
method d not being applicable. First, the function migh mightt not become con convex,
vex,
arematter
no applicable.how muc Thehother
much two waysConsider
it is blurred. that confor tinuation
example metho ds fail both
the function J ( θcorresp
) = −θond > θ.
to the metho
Second, d not being
the function ma
may yapplicable.
become conv convexFirst,
ex asthe function
a result might not
of blurring, but become
the minimum convex,
no matter how muc h
of this blurred function may trac it is blurred.
track Consider
k to a lo local for example the function
cal rather than a global minimum ofθ the J ( θ ) = θ.
Second, the
original costfunction
function.may become convex as a result of blurring, but the minimum −
of this blurred function may track to a local rather than a global minimum of the
Though contincontinuation
uation methomethods ds were mostly originally designed to deal with the
original cost function.
problem of lo local
cal minima, lo local
cal minima are no longer believ elieved
ed to be the primary
Though contin
problem for neural netw uation
network metho ds were mostly
ork optimization. Fortunately originally
ortunately,, contin designed
continuation
uation to deal
metho
methods with the
ds can
problem
still help. ofThe local minima,
easier ob
objective localfunctions
jective minimain are
intro
tro no
troduced
ducedlonger
by the believ
con ed
contin
tin to be the
tinuation
uation metho
method primary
d can
problem for neural netw ork optimization.
eliminate flat regions, decrease variance in gradien F ortunately , contin
gradientt estimates, impro uation
improv metho
ve conditioning ds can
stillthe
of help. The easier
Hessian matrix,ob jective
or do functions
anythinginelse troduced
that willby the contin
either uation
make lo metho
local
cal up d can
updates
dates
eliminate flat regions,
easier to compute or improv decrease v ariance
improvee the corresp in gradien
correspondence
ondence bet t estimates,
etween
ween lo impro
local
cal up v
updatee conditioning
date directions
of the
and Hessian
progress to
towwmatrix,
ard a global or do solution.
anything else that will either make local updates
easier to compute or improve the correspondence between local update directions
andBengio
progress et to
al.ward
(2009 a )global
observed that an approach called curriculum le
solution. learning
arning or
shaping can be interpreted as a contin continuation
uation method. Curriculum learning is based
Bengio et al. ( 2009
on the idea of planning a learning pro) observed that an
cessapproach
process to begincalled curriculum
by learning simple learning
concepts or
shaping
and can betointerpreted
progress learning more as a contin
complex uation method.
concepts Curriculum
that dep
depend
end on learning
these is based
simpler
on the idea
concepts. Thisof basic
planningstrategya learning process known
was previously to begin to baccelerate
y learningprogress
simple in concepts
animal
and progress to learning more complex
training (Skinner, 1958; Peterson, 2004; Krueger and Day concepts that dep
Dayan end on these
an, 2009) and mac simpler
machinehine
concepts. This basic strategy was previously known
learning (Solomonoff, 1989; Elman, 1993; Sanger, 1994). Bengio et al. (2009) to accelerate progress in animal
training this
justified (Skinner
strategy, 1958 as ;a Peterson
contin
continuation , 2004
uation ; Krueger
method, where andearlier
Dayan J ,(i)2009 ) and easier
are made machine by
learning (Solomonoff , 1989 ; Elman , 1993 ; Sanger
increasing the influence of simpler examples (either by assigning their con , 1994 ). Bengio et al.
tributions)
contributions( 2009
justified
to the cost thisfunction
strategylarger as a contin
co uation method,
coefficients,
efficients, or by sampling them Jmore
where earlier arefrequently),
made easierand by
increasing
exp
experimen
erimen
erimentallythe influence of simpler examples (either
tally demonstrated that better results could b e obtained by folloby assigning their con tributions
following
wing a
to the cost function
curriculum larger coneural
on a large-scale efficients, or by sampling
language mo delingthem
modeling task. more frequently),
Curriculum and
learning
experimen
has tally demonstrated
been successful on a widethat range better resultslanguage
of natural could b e(obtained
Spitko
Spitkovskyvsky byetfolloal.,wing
al. , 2010a;
curriculum
Collob
Collobert ert et on al.,,a2011a
al. large-scale
; Mikolo
Mikolov neural
v et al. language
al., , 2011b; T mo deling
u and task.
Honav
Honavar ar,Curriculum
2011) and computer learning
vision (Kumar et al., 2010; Lee and Grauman, 2011; Supancic and Ramanan,, 2013
has b een successful on a wide range of natural language ( Spitko vsky et al. 2010);
Collobert
tasks. et al., 2011a
Curriculum ; Mikolo
learning wvasetalso
al., verified
2011b; Tas u and
being Honav ar, 2011with
consistent ) andthe computer
wa
way y in
vision
whic
which ( Kumar
h humans te et ach (Khan et al., 2011): teachers start by showing easier and)
al.
teach , 2010 ; Lee and Grauman , 2011 ; Supancic and Ramanan , 2013
tasks.prototypical
more Curriculum exampleslearning wand as also
thenverified
help the aslearner
being consistent
refine the with decisionthe surface
way in
whic h humans teach ( Khan et al. , 2011
with the less obvious cases. Curriculum-based strategies are mor ): teachers start by showing
moree effe easier
ctiveand
effective for
more
teac
teaching prototypical
hing humans examples
than strategiesand then
based help
on the
uniform learner refine
sampling ofthe decision
examples, surface
and can
with the less the
also increase obvious cases. Curriculum-based
effectiveness of other teaching strategies strategies(Basu are mor and e Christensen
effective for,
teaching
2013
2013). ). humans than strategies based on uniform sampling of examples, and can
also increase the effectiveness of other teaching strategies (Basu and Christensen,
Another imp important
ortant contribution to research on curriculum learning arose in the
2013).
con
context
text of training recurrenrecurrentt neural netw networks
orks to capture long-term dep dependencies:
endencies:
Another important contribution to research on curriculum learning arose in the
context of training recurrent neural networks to capture long-term dependencies:
329
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Zarem
Zaremba ba and Sutskev
Sutskeverer (2014) found that muc much h better results were obtained with
a sto
stochastic
chastic curriculum
curriculum,, in which a random mix of easy and difficult examples is
Zarem
alw
alwa aysbapresen
and Sutskev
presented er (2014
ted to the ) found
learner, but that muc
where h bav
the etter
average
erageresults
prop were obtained
proportion
ortion with
of the more
a stochastic
difficult curriculum
examples (here, ,those
in which
with alonger-term
random mix depof easy andisdifficult
dependencies)
endencies) examples
gradually is
increased.
alw ays presen ted to the learner,
With a deterministic curriculum, no impro but where the
improvemen
vemen av erage
vementt ov over prop ortion of the more
er the baseline (ordinary
difficult examples
training from the (here, those with
full training set)longer-term
wa
wass observed.dependencies) is gradually increased.
With a deterministic curriculum, no improvement over the baseline (ordinary
We hav
havee no
now
w describ
describeded the basic family of neural netw network
ork mo
models
dels and hohoww to
training from the full training set) was observed.
regularize and optimize them. In the chapters ahead, we turn to sp specializations
ecializations of
W e hav
the neural netwe now
network describ
ork family ed the basic
family,, that allo
allow family
w neural netof neural
networks netw ork mo dels
works to scale to very largeandsizes
howand
to
regularize
pro
process anddata
cess input optimize them.
that has sp In the
special
ecial chaptersThe
structure. ahead, we turn tometho
optimization specializations
methodsds discussedof
thethis
in neural network
chapter are family , that allo
often directly w neural net
applicable to works to ecialized
these sp scale to varchitectures
specialized ery large sizeswith
and
process
little or input
no mo data that has special structure. The optimization methods discussed
modification.
dification.
in this chapter are often directly applicable to these specialized architectures with
little or no modification.

330
Chapter 9
Chapter 9
Con
Conv
volutional Net
Netw
works
Con v olutional Net w orks
Convolutional networks (LeCun, 1989), also known as convolutional neur
neural
al networks
or CNNs, are a sp specialized
ecialized kind of neural net netwwork for pro processing
cessing data that has
aConvolutional
kno
known,
wn, grid-like networks (LeCun
topology
topology. , 1989), alsoinclude
. Examples known time-series
as convolutionaldata,neur al networks
which can be
or CNNs
though , are a sp ecialized kind of neural
thoughtt of as a 1D grid taking samples at regular time in net w ork for pro
intervcessing
terv
tervals, data that
als, and image data, has
a
whickno
which wn, grid-like topology . Examples
h can b e thought of as a 2D grid of pixels. Con include time-series
Conv volutional netdata,
netw which
works hav can be
havee b een
thought of as successful
tremendously a 1D grid taking samples
in practical at regular The
applications. timename
interv“con
als,vand
“conv imageneural
olutional data,
whic
net
netwwhork”
can indicates
b e thought of as
that thea netw
2D grid
orkofemplo
network pixels.
employs ys Con volutional netop
a mathematical works
operationhavecalled
eration b een
ctremendously
onvolution
onvolution.. Con successful
Conv volutioninis practical
a sp applications.
specialized
ecialized The operation.
kind of linear name “convCon olutional
Conv neural
volutional
net
net
netww
work”
orksindicates
are simply that neural
the netw ork
net
netw emplothat
works ys a mathematical
use conv
convolution
olution op eration
in place called
of
cgeneral
onvolution . Con v olution is a sp ecialized kind
matrix multiplication in at least one of their lay of linear operation.layers.Con
ers. v olutional
networks are simply neural networks that use convolution in place of
In this
general chapter,
matrix w
wee will first in
multiplication describe
at least whatoneconv
convolution
of olution
their lay is.ers.
Next, w wee will
explain the motiv motivation
ation b ehind using conv convolution
olution in a neural netw network.
ork. We will
In this
then describ c hapter,
describee an op w e
operation will first describe
eration called poolingoling,, whicwhat
which conv olution
h almost all conv is.
convolutional Next,
olutional net weworks
netw will
explain
emplo
employ y. the motiv
Usually
Usually, ation
, the op b ehind using
operation
eration used inconv olution
a conv in a neural
convolutional
olutional neural netwnetw ork.
networkork do Wes
doese will
not
then describ
corresp
correspond e an op eration
ond precisely to thecalled pooling
definition of, conv
whicolution
h almostasall
convolution conv
used inolutional
other fieldsnetwsuc
orks
suchh
emplo y. Usually , the op eration used in
as engineering or pure mathematics. We will describe sev a conv olutional neural
several netw ork do
eral variants on the es not
corresp
con
conv ond precisely to the definition of conv
volution function that are widely used in practice for neural olution as used in othernetw fields
orks. suc
networks. Whe
as engineering
will also sh sho ow orhow pure
convmathematics.
convolution
olution ma mayy be Weapplied
will describe
to man
many sev
y eral
kindsvariants
of data,onwith the
con volution
differen
different t numumb function that are widely
b ers of dimensions. W used discuss
Wee then in practicemeans forofneural
making netw
conorks.
conv We
volution
will also
more efficien shot.w Con
efficient. howvolutional
Conv convolution net
netwwma
orksy bestandapplied
out astoanman y kinds
example of data, with
of neuroscien
neuroscientifictific
differen t n um b ers of dimensions. W e then discuss means of
principles influencing deep learning. We will discuss these neuroscientific principles, making con v olution
more conclude
then efficient. with Convcommen
olutionaltsnet
comments ab w
aboutorks
out thestand
role out
conv as an example
convolutional
olutional netw
networks of neuroscien
orks hav
havee play tific
played
ed
principles
in the historyinfluencing
of deepdeep learning.
learning. One W e willthis
topic discuss these
chapter do neuroscientific
doeses not address principles,
is how to
cthen
ho
hoose
oseconclude
the arc with commen
architecture
hitecture ts ab
of your outolutional
conv the role net
convolutional conv
netw olutional
work. netwof
The goal orks hav
this e played
chapter is
in the
to describhistory of deep
describee the kinds of to learning.
tools One
ols that con
convtopic this
volutional netwchapter
networks do
orks proes not
provide, address is
vide, while Chapter 11how to
cho ose the architecture of your convolutional network. The goal of this chapter is
to describ e the kinds of to ols that convolutional 331 networks provide, while Chapter 11

331
CHAPTER 9. CONVOLUTIONAL NETWORKS

describ
describes
es general guidelines for cho hoosing
osing which tools to use in whic
whichh circumstances.
Researc
Research h in
into
to con
convvolutional net
netwwork archi
architectures
tectures pro
proceeds
ceeds so rapidly that a new
describ
b eshitecture
est arc general guidelines
architecture for a giv enfor
given chohmark
b enc osing which
enchmark tools to ev
is announced use
eryinfew
every whicwheeks
circumstances.
to months,
Researc h into con volutional net
rendering it impractical to describ w ork architectures pro ceeds so
describee the b est architecture in prin rapidly
print.
t. Ho
Howthat
wev a new
ever,
er, the
b est architecture
b est architectur
architectures for
es hava given b
havee consistenenc
consistently hmark is announced
tly b een comp
composed every few w eeks to months,
osed of the building blocks describ
described
ed
rendering it impractical to describ e the b est architecture in print. However, the
here.
b est architectures have consistently b een comp osed of the building blocks describ ed
here.
9.1 The Con
Conv
volution Op
Operation
eration

9.1its most
In Thegeneral
Conform,volutioncon
conv Operation
volution is an op eration on tw
operation twoo functions of a real-
valued argument. To motiv motivateate the definition of con convvolution, we start with examples
In its most general
of two functions we migh form, con
mightt use. volution is an op eration on two functions of a real-
valued argument. To motivate the definition of convolution, we start with examples
Supp
Supposeose we are tracking the lo location
cation of a spaceship with a laser sensor. Our
of two functions we might use.
laser sensor provides a single output x(t), the p osition of the spaceship at time
Suppxose
t. Both andwet are
are tracking
real-v
real-valued,the i.e.,
alued, lo cation
we can of get
a spaceship
a different with a laser
reading sensor.
from Our
the laser
laser sensor
sensor at an
anyyprovides
instan
instantt in time. output x(t), the p osition of the spaceship at time
a single
t. Both x and t are real-valued, i.e., we can get a different reading from the laser
No
Now
sensor w
atsuppose
any instan that
t inour laser sensor is somewhat noisy
time. noisy.. To obtain a less noisy
estimate of the spaceship’s p osition, we would like to av erage together sev
average several
eral
No w
measuremen suppose
measurements. that our laser sensor
ts. Of course, more recent measuremenis somewhat
measurements noisy .
ts are moreTo relev
obtain
relevan
an t,a so
ant, lesswenoisy
will
estimate
wan of the spaceship’s
antt this to b e a weigh
eighted
ted av p osition,
average we w ould like
erage that gives more weigh to av erage together
weightt to recent measuremen sev
measurements. eral
ts.
measuremen ts. Of course,
We can do this with a weigh more recent measuremen ts are more relev an
ting function w(a), where a is the age of a measuremen
weighting t, so w
measurement.e will
t.
wan
If wet this
applytosuc
b e ha awweigh
such eightedtedavav
weighted erage
eragethat
average op gives more
operation
eration weighmoment,
at every t to recentwemeasuremen
obtain a new ts.
We can dos this
function pro with aaweigh
providing
viding smo ting function
smoothed
othed estimatew(ofa),the
where a is the
position of age
the of a measurement.
spaceship:
If we apply such a weighted averageZ op eration at every moment, we obtain a new
function s providing a smo othed s(t) =estimate
x(a)wof(tthe
− aposition
)da of the spaceship: (9.1)

s(t) = x(a)w (t a)da (9.1)


This opoperation
eration is called convolution
onvolution.. The conv convolution
olution opoperation
eration is typically

denoted with an asterisk:
This op eration is called convolution
s(t) = (x. ∗The w )(tconv
) olution op eration is typically (9.2)
denoted with an asterisk: Z
In our example, w needs to sb(te) a= v(alid x wprobability
)(t) density function, or(9.2) the
output is not a weigh
weightedted av erage. Also, w needs
average. ∗ to b e 0 for all negative argumen
arguments,
ts,
In our
or it will lo example,
look w needs to b e a valid probability
ok into the future, which is presumably b ey eyonddensity function, or
ond our capabilities. These the
output is not a weigh ted av erage. Also, w needs
limitations are particular to our example though. In general, to b e 0 for allcon
convnegative
volutionargumen ts,
is defined
or itan
for will
any lo ok intoforthe
y functions future,
which thewhich
ab ovee isintegral
abov
ov presumably b eyond
is defined, andour
may capabilities.
b e used forThese
other
limitations
purp
purposes are particular
oses besides to ourted
taking weigh
weighted example
av
averages.
erages.though. In general, convolution is defined
for any functions for which the ab ove integral is defined, and may b e used for other
In con
conv volutional netnetw
work terminology
terminology,, the first argumenargumentt (in this example, the
purp oses besides taking weighted averages.
function x) to the con conv
volution is often referred to as the input and the second
In convolutional network terminology, the first argument (in this example, the
function x) to the convolution is often332 referred to as the input and the second
CHAPTER 9. CONVOLUTIONAL NETWORKS

argumentt (in this example, the function w) as the kernel. The output is sometimes
argumen
referred to as the fefeatur
atur
aturee map
map..
argument (in this example, the function w) as the kernel. The output is sometimes
In our example, the idea of a laser sensor that can pro provide
vide measuremen
measurements ts
referred to as the feature map.
at every instant in time is not realistic. UsuallyUsually,, when we work with data on a
In our example, the idea of a laser sensor
computer, time will b e discretized, and our sensor thatwill
canprovide
providedata
measuremen ts
at regular
at
in every
interv
terv als.instant
tervals. In ourinexample,
time is not realistic.
it might Usually
b e more , whentoweassume
realistic work with
that data on a
our laser
computer,
pro
provides time will b e discretized,
vides a measurement and our
once p er second. Thesensor will provide
time index data
t can then at on
take regular
only
in
interv als. In our example, it might b e more realistic to
teger values. If we now assume that x and w are defined only on in
integer assume that our
integer laser
teger t, we
pro vides a measurement
can define the discrete con once
conv p er
volution: second. The time index t can then take on only
integer values. If we now assume that x and w are defined only on integer t, we

X
can define the discrete convolution:
s(t) = (x ∗ w )(t) = x(a)w (t − a) (9.3)
a=−∞
s(t) = (x w )(t) = x(a)w (t a) (9.3)
In mac
machine
hine learning applications,∗ the input is usually−a multidimensional arra arrayy
of data and the kernel is usually a multidimensional array of parameters that are
In mac
adapted byhine
the learning
learning applications,
algorithm. Wthe input
e will is to
refer usually
theseammultidimensional
ultidimensionalarra
arraysy
arrays
of data andBecause
the kernel is elemen X
usuallyt aofmultidimensional array of parameters that are
as tensors. each element the input and kernel must b e explicitly stored
adapted by, we
separately
separately, the usually
learningassume
algorithm.
thatW e willfunctions
these refer to these multidimensional
are zero everywhere but arra ys
the
as tensors.
finite set ofBecause
points for each elemen
which wetstore
of thethe input and This
values. kernelmeans
must that
b e explicitly stored
in practice we
separately , we usually assume that these
can implement the infinite summation as a summation ov functions are zero
overeverywhere
er a finite num but
number the
ber of
finite
arra
arrayy set of points
elemen
elements. ts. for which we store the values. This means that in practice we
can implement the infinite summation as a summation over a finite number of
arraFinally
Finally,
y elemen, wts.
e often use conconvvolutions ov over
er more than one axis at a time. For
example, if we use a tw o-dimensional image I as our input, we probably also wan
two-dimensional wantt
Finally , w e often use con
to use a two-dimensional kernel K : v olutions ov er more than one axis at a time. For
example, if we use a two-dimensional X XI as our input, we probably also want
image
to use a two-dimensional
S (i, j ) = (I ∗kKernel ) :=
)(i, jK I (m, n)K (i − m, j − n). (9.4)
m n
S (i, j ) = (I K )(i, j ) = I (m, n)K (i m, j n). (9.4)
Con
Conv volution is commcommutativ
∗utativ
utative, e, meaning we can equiv equivalently
alently
− write:

XX
Convolution S (i,isj )comm
= (Kutativ
∗ I )(i,e,j )meaning
= weI (can
i − equiv
m, j −alently write:
n)K (m, n). (9.5)
Xm nX
S (i, j ) = (K I )(i, j ) = I (i m, j n)K (m, n). (9.5)
Usually the latter formula ∗ is more straightforw
straightforward
− ard to− implemen
implementt in a machine
library,, b ecause there is less variation in the range of valid values of m
learning library
Usually
and n. the latter formula is more straightforward to implement in a machine
learning library, b ecause there is lessX X
variation in the range of valid values of m
The commutativ
commutativee prop property
erty of con conv
volution arises b ecause we hav havee flipp
flippeed the
and n.
relativee to the input, in the sense that as m increases, the index into the
kernel relativ
The
input commutativ
increases, but ethe
prop erty into
index of con volution
the arises b ecause
kernel decreases. Thewonly
e hav e flippto
reason ed flip
the
k ernel relativ e to the input,
the kernel is to obtain the commutativin the sense
commutativee propertythat as m increases, the
property.. While the commutativindex into
commutativee prop the
propert
ert
erty
y
input increases, but the index into the kernel decreases. The only reason to flip
the kernel is to obtain the commutative 333 property. While the commutative prop erty
CHAPTER 9. CONVOLUTIONAL NETWORKS

is useful for writing pro proofs,


ofs, it is not usually an imp important
ortant prop
property
erty of a neural
net
netwwork implemen
implementation.
tation. Instead, many neural netw network
ork libraries implement a
is usefulfunction
related for writing
calledpro ofs,cr
the it is orr
cross-c
oss-c notelation
oss-corr usually
orrelation
elation, an imp
, which is ortant
the sameprop
as erty
convof a neural
convolution
olution but
net work implemen tation.
without flipping the kernel: Instead, many neural netw ork libraries implement a
related function called the cross-correlation, which is the same as convolution but
XX
without flippingS (i, the
j ) =kernel:
(I ∗ K )(i, j ) = I (i + m, j + n)K (m, n). (9.6)
m n
S (i, j ) = (I K )(i, j ) = I (i + m, j + n)K (m, n). (9.6)
Man
Many y machine learning libraries ∗ implement cross-correlation but call it conv
convolution.
olution.
In this text we will follow this conv conven en
ention
tion of calling b oth op operations
erations con convvolution,
Man
and spy machine
ecify whether we mean to flip the kernel or not in contexts whereolution.
specify learning libraries implement cross-correlation but call it conv kernel
In this text
flipping weant.
is relev will In
relevant. follow this conv
the context ofenXtion
mac X
machine of calling
hine learning, b oth
theop erationsalgorithm
learning convolution,
will
and sp ecify whether w e mean to flip the k ernel or not in
learn the appropriate values of the kernel in the appropriate place, so an algorithm contexts where k ernel
flippingonisconv
based relev ant. Inwith
convolution
olution the context of machine
kernel flipping learning,
will learn the learning
a kernel that is flippalgorithm
flipped will
ed relative
learn the appropriate v alues of the kernel in the appropriate
to the kernel learned by an algorithm without the flipping. It is also rare for place, so an algorithm
based
con
conv on conv
volution toolution
b e usedwithalone
kernel in flipping
machinewill learn a kernel
learning; insteadthat convis olution
flipp ed relative
convolution is used
to
sim the kernel
simultaneously learned by an algorithm without the
ultaneously with other functions, and the combination of these functions doflipping. It is also rare for
doeses
con v olution to b e used alone
not commute regardless of whether the conv in machine learning;
convolution
olution opinstead
operation conv olution
eration flips its kernel oris used
sim
not.ultaneously with other functions, and the combination of these functions do es
not commute regardless of whether the convolution op eration flips its kernel or
See Fig. 9.1 for an example of con convvolution (without kernel flipping) applied to
not.
a 2-D tensor.
See Fig. 9.1 for an example of convolution (without kernel flipping) applied to
Discrete con conv volution can b e viewed as multiplication by a matrix. How However,
ever, the
a 2-D tensor.
matrix has several entries constrained to b e equal to other en entries.
tries. For example,
for Discrete
univ
univariate
ariatecondiscrete
volutionconv can olution,
b e viewed
convolution, eacash mro
each ultiplication
row w of the matrix by a matrix. However,
is constrained to the
be
matrix has several
equal to the row ab entries constrained
aboove shifted by one elemen to b
element.e equal to other en tries. F
t. This is known as a Toeplitz matrixor example,
matrix..
for
In twuniv
two ariate discrete
o dimensions, a doubly blo conv olution,
block eac
ck cir h
circulantro w of the
culant matrix corresp matrix is
corresponds constrained
onds to conv to b e
convolution.
olution.
equal to the row ab o
In addition to these constrainve shifted
constraints byts that several elements b e equal to each other,.
one elemen t. This is known as a T oeplitz matrix
In
con
convtw o dimensions,
volution a doubly blotocka cir
usually corresponds veryculant
sparse matrix
matrixcorresp ondswhose
(a matrix to conv olution.
entries are
In addition to these constrain ts that several
mostly equal to zero). This is b ecause the kernel is usually muc elements b e equal
much to each
h smaller than theother,
con v olution usually
input image. Any neural netwcorresponds
network to a v ery sparse matrix
ork algorithm that works with matrix (a matrix whose entries are
multiplication
mostly
and dodoes equal
es not depto zero).
depend This
end on sp is
specificb ecause
ecific prop the
properties kernel is usually muc h
erties of the matrix structure should smaller thanworkthe
input conv
with image.
convolution, Any neural
olution, without netw ork algorithm
requiring an
any that works
y further changeswithtomatrix multiplication
the neural netw
network.ork.
and do es
Typical conv not dep
convolutionalend on sp
olutional neural netw ecific prop
networks erties of the matrix
orks do make use of further sp structure should
specializations work
ecializations in
with conv olution, without
order to deal with large inputs efficientlyrequiring an y further changes to the
efficiently,, but these are not strictly necessary neural netw ork.
from
T ypical conv olutional
a theoretical p erspective. neural netw orks do make use of further sp ecializations in
order to deal with large inputs efficiently, but these are not strictly necessary from
a theoretical p erspective.

334
CHAPTER 9. CONVOLUTIONAL NETWORKS

Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

Figure 9.1: An example of 2-D convconvolution


olution without kernel-flipping. In this case we restrict
the output to only positions where the kernel lies entirely within the image, called “v “valid”
alid”
Figure
con
conv 9.1: An example of 2-D conv
volution in some contexts. We dra olution
draw without kernel-flipping. In this case
w b oxes with arrows to indicate how the uppw e restrict
upper-left
er-left
the output
elemen
element t of to
theonly positions
output where
tensor the kernel
is formed lies entirely
by applying thewithin
kernelthe
toimage, calledonding
the corresp “valid”
corresponding
conver-left
upp olutionregion
upper-left in some contexts.
of the We draw b oxes with arrows to indicate how the upp er-left
input tensor.
element of the output tensor is formed by applying the kernel to the corresp onding
upp er-left region of the input tensor.

335
CHAPTER 9. CONVOLUTIONAL NETWORKS

9.2 Motiv
Motivation
ation

9.2volution
Con
Conv Motiv ationthree imp
leverages importan
ortan
ortantt ideas that can help improv improvee a machine
learning system: sp sparse
arse inter
interactions
actions, par arameter
ameter sharing and equivariant repr epresenta-
esenta-
Con
tions volution
tions.. Moreov
Moreover, leverages
er, conv
convolutionthree
olution pro imp ortan
provides t ideas that can help improv
vides a means for working with inputs of variable e a machine
learning system:
size. We now describ describee each of these, ideas
sp arse inter actions parameter
in turn.sharing and equivariant representa-
tions. Moreover, convolution provides a means for working with inputs of variable
Traditional neural netw network ork laylayers
ers use matrix multiplication by a matrix of
size. We now describ e each of these ideas in turn.
parameters with a separate parameter describing the interaction b etw etween een each
T raditional neural netw ork lay ers use matrix multiplication
input unit and each output unit. This means every output unit interacts with every b y a matrix of
parameters
input with olutional
unit. Conv a separatenet
Convolutional parameter
netw works, ho how wdescribing
ev
ever,
er, typicallythe interaction
hav
havee sp arseb etw
sparse inter een
interactions each
actions
inputreferred
(also unit andtoeach as sp output
sparse
arse conne unit.ctivity
This means
onnectivity or sp every
sparse
arse output
weights ). unit
Thisinteracts with every
is accomplished by
input unit. Conv olutional net w orks, ho
making the kernel smaller than the input. For example, when prow ev er, typically hav e sp arse
processing inter actions
cessing an image,
(also referred to
the input image might ha as sp arse
hav ve thousands or millions of pixels, but we accomplished
conne ctivity or sp arse weights ). This is can detect small, by
making the features
meaningful kernel smaller
such asthan edges the input.
with For that
kernels example,o ccup
ccupy when
y onlypro tenscessing an image,
or hundreds of
the input
pixels. Thisimage
means might
thathawe ve need
thousands
to storeor millions
few
fewer of pixels, but
er parameters, whichwe bcan othdetect
reduces small,
the
meaningful
memory features such
requiremen
requirements ts of as theedges
mo
model with
del kernels
and improv
improvesthat
es oitsccup y only tens
statistical or hundreds
efficiency
efficiency. . It alsoof
pixels. that
means This computing
means thatthe we output
need to requires
store fewfewer
er parameters,
op
operations.
erations. which
These b oth reduces
improv
improvementsementsthe
memory requiremen ts of the mo del and improv es
in efficiency are usually quite large. If there are m inputs and n outputs, then its statistical efficiency . It also
means that
matrix computingrequires
multiplication the outputm × nrequires
parameters fewer and opthe
erations.
algorithmsTheseused improv ements
in practice
in
ha
hav vefficiency
e O(m × are usually quite
n ) runtime (p
(per large. If there
er example). If weare m inputs
limit the numnumbandb ern ofoutputs,
connections then
matrix
eac
each multiplication
h output ma
may y ha
hav vrequires
e to k, then m nthe parameters
sparsely and the algorithms
connected approac
approach used in practice
h requires only
ha v e O (m n ) runtime
k × n parameters and O(k × n) run (p er example).
× runtime. If w e limit the num b
time. For many practical applications, it iser of connections
eac
p h output
ossible may ha
to ×obtain go
goo k, then the on
voedtop erformance sparsely connected
the machine approac
learning taskh while
requires only
keeping
k sev
k neral
severalparameters
orders ofand O(k nsmaller
magnitude ) runtime. thanFm or. many
F
For practical demonstrations
or graphical applications, it of is
p ossible to obtain go
× connectivity
sparse connectivity, o d pFig.
, see erformance
× 9.2 and on Fig.the9.3machine
. In a deep learning
conv task
convolutional whilenetw
olutional keeping
network,ork,
k sev eral orders
units in the deep deeper of magnitude
er lay
layersers may indir smaller than m . F or graphical demonstrations
indireectly interact with a larger p ortion of the input, of
sparse
as sho
shown connectivity
wn in Fig. 9.4, .see This Fig. 9.2
allo ws and
allows the Fig.
net
netw 9.3. toInefficiently
work a deep conv olutional
describ
describe network,
e complicated
units
in in the deep
interactions
teractions b etwer
etw eenlay ers may
many indirectly
variables interact withsuch
by constructing a larger
in p ortion of
interactions
teractions thesimple
from input,
as shownblo
building in cFig.
bloc ks that9.4.eacThis
each allows the
h describe onlynet work interactions.
sparse to efficiently describ e complicated
interactions b etween many variables by constructing such interactions from simple
Par
Parameter
ameter sharing refers to using the same parameter for more than one
building blo cks that each describe only sparse interactions.
function in a mo model.
del. In a traditional neural net, each element of the weigh eightt matrix
Par ameter sharing refers to using
is used exactly once when computing the output of a lay the same parameter
layer. for more
er. It is multiplied than by one
one
function
elemen in a mo del. In
elementt of the input and then nev a traditional
never neural net, each element of
er revisited. As a synonym for parameter sharing,the weigh t matrix
is used
one can sa exactly
say once
y that a netw when
network computing
ork has tie tied the output
d weights
weights, , b ecause of athelayver.
alueItofisthemultiplied
weigh by one
weightt applied
elemen
to one tinput
of theisinput
tied to and thethenvaluenevofer arevisited.
weigh As a synonym
eightt applied elsewhere. for parameter
In a conv sharing,
convolutional
olutional
one cannet,
neural sayeach
that member
a network of has
the tie d weights
kernel is used, b ecause
at every the value ofofthe
p osition theweigh
inputt (except
applied
toerhaps
p one input
someisoftied the to the valuepixels,
b oundary of a wdepeigh t applied
depending
ending on theelsewhere. In a convregarding
design decisions olutional
neural net, each member of the kernel
the b oundary). The parameter sharing used by the con is used at every p
convosition
volution operation(except
of the input means
p erhaps
that some
rather thanof the b oundary
learning pixels,set
a separate depofending
parameterson thefor design
ev erydecisions
every lo cation,regarding
location, we learn
the b oundary). The parameter sharing used by the convolution operation means
that rather than learning a separate set336 of parameters for every lo cation, we learn
CHAPTER 9. CONVOLUTIONAL NETWORKS

s1 s2 s3 s4 s5

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

x1 x2 x3 x4 x5

Figure 9.2: Sp
Sparse
arse conne
onnectivity,
ctivity, viewe
viewedd fr om below: We highlight one input unit, x3 , and
from
also highlight the output units in s that are affected by this unit. (T op) When s is formed
(Top)
Figure
by conv 9.2: Sparse
convolution
olution withconne ctivity,
a kernel viewed3fr
of width , om
onlybelow:
three W e highlight
outputs one input
are affected by unit, x , and
x. (Bottom)
also highlight
When the output
s is formed unitsmin
by matrix s that are affected
ultiplication, by this
connectivity is unit. (Top)sparse,
no longer Whenso s is
allformed
of the
by convolution
outputs with by
are affected a kx
ernel 3
of width , only three outputs are affected by . x (Bottom)
3.
When s is formed by matrix multiplication, connectivity is no longer sparse, so all of the
outputs are affected by x .

337
CHAPTER 9. CONVOLUTIONAL NETWORKS

s1 s2 s3 s4 s5

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

x1 x2 x3 x4 x5

Figure 9.3: Sp Sparse


arse conne
onnectivity,
ctivity, viewe
viewedd fr
from
om ab ove: We highligh
above: highlightt one output unit, s3 , and
also highligh
highlightt the input units in x that affect this unit. These units are known as the
rFigure
eceptive9.3: Spof
field arse
s3 .conne
(T
(Top) ctivity,
op) Whenviewe
s is dformed
from ab byove:
conv W e highligh
convolution
olution witht one output
a kernel unit, s3, only
of width and
also highligh
three t the input
inputs affect s3 . (Bottom) x
units inWhenthat s isaffect thisbyunit.
formed These
matrix units are known
multiplication, as the
connectivity
recno
is eptive fieldsparse,
longer (Top)
of s . so When
all of s is formed
the inputs affectbsy3 conv
. olution with a kernel of width 3, only
three inputs affect s . (Bottom) When s is formed by matrix multiplication, connectivity
is no longer sparse, so all of the inputs affect s .
g1 g2 g3 g4 g5

h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

Figure 9.4: The receptive field of the units in the deep deeper
er lay
layers
ers of a conv
convolutional
olutional net
netw work
is larger than the receptiv
receptivee field of the units in the shallow la lay
yers. This effect increases if
Figure
the 9.4:
netw orkThe
network receptive
includes field of the
architectural units in
features lik
likethe deep erconv
e strided layers of a conv
convolution
olution (Fig.olutional
9.12) or net work
p o oling
is larger
(Sec. 9.3).than
Thisthe receptiv
means thate field
even of
even the units
though direectinconnections
dir the shallowinlayaers.
convThis effectnet
convolutional
olutional increases
are very if
the netwunits
sparse, ork includes architectural
in the deep
deeper
er lay
layers features
ers can likeectly
b e indir
indire strided convolution
connected to all(Fig. 9.12of
or most ) or
th
thepe oinput
oling
image.9.3). This means that even though direct connections in a convolutional net are very
(Sec.
sparse, units in the deep er layers can b e indirectly connected to all or most of th e input
image. 338
CHAPTER 9. CONVOLUTIONAL NETWORKS

s1 s2 s3 s4 s5

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

x1 x2 x3 x4 x5

Figure 9.5: Par ameter sharing: Black arrows indicate the connections that use a particular
Parameter
parameter in two differen
differentt models. (T op) The black arrows indicate uses of the cen
(Top) central
tral
elemen
element of aPar
Figure t9.5: ameter sharing:
3-element kernel inBlack arrows
a conv indicate
convolutional
olutional mo the connections
model.
del. that usesharing,
Due to parameter a particular
this
parameter
single in twoisdifferen
parameter used att all input (T
models. lo op) The(Bottom)
locations.
cations. black arrows
The indicate uses
single blac
black of theindicates
k arrow central
elemen
the usetofofthe
a 3-element
cen
central kernelofinthe
tral element a conv olutional
weigh
eightt matrixmoin del. Due
a fully to parameter
connected mo del.sharing,
model. This mo this
model
del
single
has noparameter
parameterissharing
used atsoallthe
input lo cations.
parameter (Bottom)
is used The single black arrow indicates
only once.
the use of the central element of the weight matrix in a fully connected mo del. This mo del
has no parameter sharing so the parameter is used only once.
only one set. This do does
es not affect the runtime of forward propagation—it is still
O(k × n)—but it do does
es further reduce the storage requiremen requirements ts of the mo model
del to
only
k one set. This
parameters. do es
Recall notkaffect
that the runtime
is usually several of forward
orders propagation—it
of magnitude is still
less than m.
O ( k n)—but it do es further reduce the storage requiremen
Since m and n are usually roughly the same size, k is practically insignificant ts of the mo del to
k parameters.
×
compared to mRecall
× n. Conv k is usually
thatolution
Convolution is thusseveral orders ofmore
dramatically magnitude
efficientless than than m.
dense
Since m and n are usually roughly the
matrix multiplication in terms of the memory requiremen same size, k
requirements is practically insignificant
ts and statistical efficiency
efficiency..
compared
F or a graphicalto m depiction
n. Convofolution is thus dramatically
how parameter sharing works, more seeefficient
Fig. 9.5than
. dense
matrix multiplication × in terms of the memory requirements and statistical efficiency.
As an example of b oth of these first two principles in action, Fig. 9.6 sho shows
ws how
For a graphical depiction of how parameter sharing works, see Fig. 9.5.
sparse connectivit
connectivity y and parameter sharing can dramatically impro improv ve the efficiency
As an example of b oth of these first
of a linear function for detecting edges in an image. tw o principles in action, Fig. 9.6 shows how
sparse connectivity and parameter sharing can dramatically improve the efficiency
of aInlinear
the case of conv
function convolution,
forolution,
detecting theedges
particular
in anformimage.of parameter sharing causes the
la
lay
yer to hav havee a prop
propertert
ertyy called equivarianc
quivariancee to translation. To say a function is
equiv Inariant means that if the inputparticular
the
equivariant case of conv olution, the changes, form of parameter
the output changessharing
in the causes
same wtheay.
la
Sp y er to hav e a prop ert y called
ecifically,, a function f (x) is equiv
Specifically
ecifically equivarianc e to translation. T o say a
ariantt to a function g if f (g (x)) = g (f (x))
equivarian
arian function is.
)).
equiv
In theariant
case means
of con
convvthat if the
olution, if input
we letchanges,
g b e an
any ythe outputthat
function changes in the the
translates same way.
input,
Sp ecifically
i.e., shifts it,, athen
function
the convf (xolution
) is equiv
convolution ariant istoequiv
function a function
equivariant
ariant to g gif. fF(or
g (xexample,
)) = g (flet(x))I.
Ine the
b case of giving
a function convolution, if we let at
image brightness g b einteger
any function that translates
coordinates. Let g b e athe input,
function
mapping one image function to another image function, such that I 0 = g (Ilet
i.e., shifts it, then the conv olution function is equiv ariant to g . F or example, I
) is
b e a function giving image brightness at integer coordinates. Let g b e a function
mapping one image function to another 339image function, such that I = g (I ) is
CHAPTER 9. CONVOLUTIONAL NETWORKS

the image function with I 0 (x, y ) = I ( x − 1, y). This shifts every pixel of I one
unit to the right. If we apply this transformation to I , then apply con conv volution,
the image function with I (x, y
the result will b e the same as if we applied con ) = I ( x 1 , y ) .
conv This shifts every
0
volution to I , then appliedpixel of I onethe
unit to the right. If we apply
transformation g to the output. When pro this transformation
− processing to I , then apply
cessing time series data, this means con v olution,
the result
that conv will
convolution b
olution proe the duces a sort of timeline con
same
produces as if we applied that volution
sho
shows to I , different
ws when then applied the
features
transformation
app
appearear in the input. g to the output.
If we movee When
mov an even
event pro cessing
t later time in
in time series
the data,
input,thisthemeans
exact
that conv
same represen olution
representation pro duces a sort of timeline that sho ws
tation of it will appear in the output, just later in time. Similarly when different features
app ear in
with images, con the input.
conv volution If we move aan
creates 2-Devenmap t later
of wherein time in the
certain input, app
features the ear
appear exactin
same represen
the input. If we mov tation of it
movee the ob will appear
object in the output, just
ject in the input, its representation will mov later in time. Similarly
movee the
with
sameimages,
amounttcon
amoun volution
in the output. creates
This ais2-D map
useful forofwhen where wecertain
kno
know features
w that someapp ear in
function
thea input.
of small num If we
number bermov e the ob ject pixels
of neighboring in theisinput, usefulits representation
when applied to m will movinput
ultiple e the
same
lo amoun
locations.
cations. Fort inexample,
the output. when This
pro iscessing
useful images,
processing for whenit wise useful know that some edges
to detect function in
of a small
the first lay num
layer ber of
er of a conv neighboring
convolutional
olutional net pixels
netw is useful when
work. The same edges app applied to
appear m ultiple
ear more or lessinput
lo
ev cations.
erywhereFin
everywhere or the
example,
image,when so it pro cessing images,
is practical to shareitparameters
is useful toacross detectthe edges
entirein
the firstIn
image. laysome
er of cases,
a convwe olutional
may not netwish
work.toThe sharesame edges appacross
parameters ear morethe or less
entire
everywhere
image. in the image,
For example, if we so areitpro
is practical
processing
cessing images to share thatparameters
are cropp
cropped across
ed to b ethe entire
centered
image. In some cases, we
on an individual’s face, we probably wan may not wish to share
wantt to extract differen parameters across the
differentt features at different entire
image.
lo For example,
locations—the
cations—the part ofifthe we netw
are pro
networkorkcessing
pro
processing images
cessing the that
top of arethe cropp
faceed to bto
needs e centered
lo ok for
look
on
ey an
eyebro
ebro
ebrows,individual’s face, we
ws, while the part of the netw probably
network wan
ork pro t to
processingextract differen t features
cessing the b ottom of the face needs to at different
lo
lo cations—the
look
ok for a chin. part of the netw ork pro cessing the top of the face needs to lo ok for
eyebrows, while the part of the network pro cessing the b ottom of the face needs to
lo okCon
Conv
forvaolution
chin. is not naturally equiv equivarian
arian
ariantt to some other transformations, such
as changes in the scale or rotation of an image. Other mec mechanisms
hanisms are necessary
Con v olution is not naturally
for handling these kinds of transformations. equiv arian t to some other transformations, such
as changes in the scale or rotation of an image. Other mechanisms are necessary
Finally
Finally,, some
for handling thesekinds
kindsofofdata cannot b e pro
transformations. processed
cessed by neural netw networksorks defined by
matrix multiplication with a fixed-shape matrix. Conv Convolution
olution enables pro processing
cessing
Finally
of some of ,these
some kinds
kinds of of data.
data cannot
We discuss b e pro cessed
this by neural
further in Sec.netw 9.7.orks defined by
matrix multiplication with a fixed-shape matrix. Convolution enables pro cessing
of some of these kinds of data. We discuss this further in Sec. 9.7.
9.3 Pooling

9.3
A typicalPla oyoling
layer of a con
conv volutional netw
network
ork consists of three stages (see Fig. 9.7). In
the first stage, the lay layer
er p erforms sev
several
eral con
convvolutions in parallel to proproduce
duce a set
A typical la
of linear activyer of a
activations. con volutional netw ork consists
ations. In the second stage, each linear activof three stages
activation (see Fig. 9.7). Ina
ation is run through
the first stage,
nonlinear activ the lay
activation
ation er p erforms
function, suc
suchhsev
aseral
the con volutions
rectified in activ
linear parallel
activation
ationtofunction.
pro duce aThisset
of linear activ ations. In the
stage is sometimes called the dete second ctor stage. In the third stage, we use a poolinga
detectorstage, each linear activation is run through
nonlineartoactiv
function mo ation
modify function,
dify the outputsucof hthe
as la
the
yerrectified
lay further.linear activation function. This
stage is sometimes called the detector stage. In the third stage, we use a pooling
A p o oling function replaces the output of the net at a certain lo location
cation with
function to mo dify the output of the layer further.
a summary statistic of the nearb nearby y outputs. For example, the max pooling (Zhou
A p o oling function
and Chellappa, 1988) op replaces
operation the
eration rep
reportsoutput
orts of theum
the maxim
maximum netoutput
at a certain
within loa cation with
rectangular
a summary statistic of the nearby outputs. For example, the max pooling (Zhou
and Chellappa, 1988) op eration rep orts 340 the maximum output within a rectangular
CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.6: Efficiency of edge dete ction.. The image on the right was formed by taking
detection
ction
eac
eachh pixel in the original image and subtracting the value of its neighboring pixel on the
Figure
left. 9.6:sho
This Efficiency
shows of edge of
ws the strength dete
allction . The
of the image on
vertically the right
oriented edgeswasinformed
the inputby taking
image,
eac
which
which hpixel
can in
b ethe
a original
useful op image
operation
eration and
for subtracting
ob
object
ject the
detection.v alue
Bothof its neighboring
images are 280 pixel
pixelson the
tall.
left. This sho ws the strength of all of the v ertically oriented
The input image is 320 pixels wide while the output image is 319 pixels wide. This edges in the input image,
whic h can b e a can
transformation useful
b e op eration
describ
describeded for
by aobconv
ject olution
detection.
convolution kernelBothcon images
taining are
containing tw
twoo280 pixels tall.
elements, and
The input
requires 319image
× 280is ×3203 = pixels widefloating
267,, 960
267 while the output
p oin
oint
t op image is
operations
erations (tw319
(two pixels wide. This
o multiplications and
transformation
one addition p er can b e describ
output pixel)edtobycompute
a convolution
using kernel
con containing
convolution.
volution. two elements,
To describ
describe e the sameand
requires 319
transformation 280 3 = 267 , 960 floating pwould
with a matrix multiplication oint op erations
tak
takee 320 ×(tw 280o ×multiplications
319 × 280280,, or oand
ver
one
eighttaddition
eigh billion, en×ptries
er output
entries ×in the pixel)
matrix,tomaking
compute con using
convolution
volutionconfour
volution. To describ
billion times more eefficient
the same for
transformation
represen
representing with a matrix multiplication
ting this transformation. The straightforw would
straightforward tak e 320 280 319
ard matrix multiplication algorithm280 , or ov er
eigh
p t billion,
erforms overensixteen
tries inbillion
the matrix,
floating making
ointt con
p oin op volution making
operations,
erations, four billion
×con
convvtimes
× more
olution × efficient
roughly for
60,000
represen
times tingefficient
more this transformation.
computationally
computationally. The
. Of straightforw
course, mostard matrix
of the entriesmultiplication
of the matrixalgorithm
would b e
p erforms
zero. If weovstored
er sixteen
onlybillion floatingen
the nonzero ptries
oint of
entries opthe
erations,
matrix,making
then b con volution
oth matrix roughly 60,000
multiplication
timesconv
and more efficient
convolution
olution computationally
would require the same. Of course,
num bermost
number of the entries
of floating p oint opof erations
the matrix
operations to would
compute. be
zero.matrix
The If we stored
wouldonly
stillthe
neednonzero entries
to contain 2×of 319
the matrix,
× 280 =then 178,,b640
178 oth entries.
matrix multiplication
Conv
Convolution
olution
and
is anconv olution efficien
extremely would
efficient require
t wa
way y ofthe same numtransformations
describing ber of floating pthat oint op erations
apply to compute.
the same linear
The matrix would
transformation of astill needlo
small, to
calcontain
local 2 319 280 = 178
region across the entire input. (Photo, 640 entries.credit:
Convolution
Paula
is
Go an extremely efficient way of describing×transformations
Goodfellow)
odfellow) × that apply the same linear
transformation of a small, lo cal region across the entire input. (Photo credit: Paula
Go odfellow)

341
CHAPTER 9. CONVOLUTIONAL NETWORKS

Complex layer terminology Simple layer terminology

Next layer Next layer

Convolutional Layer

Pooling stage Pooling layer

Detector stage:
Detector layer: Nonlinearity
Nonlinearity
e.g., rectified linear
e.g., rectified linear

Convolution stage: Convolution layer:


Affine transform Affine transform

Input to layer Input to layers

Figure 9.7: The compcomponen


onen
onents
ts of a typical conv
convolutional
olutional neural netw network
ork lay
layer.
er. There are two
commonly used sets of terminology for describing these la lay
yers. (L eft) In this terminology
(Left) terminology,,
Figure
the conv 9.7: The
convolutional comp onen ts of a typical
olutional net is viewed as a small numconv olutional
number neural
ber of relativ netw
relatively ork layer.
ely complex lay There
layers, areeach
ers, with two
commonly
la
layyer having used sets“stages.”
many of terminology
In this for describing, there
terminology
terminology, these is
layaers. (Left)
one-to-oneInmapping
this terminology
b et ween,
etw
the
k conv
ernel olutional
tensors andnet
net
netwisork
w viewed
lay as aInsmall
layers.
ers. this bnum
o okberwe of relatively
generally usecomplex layers, with
this terminology
terminology. each
. (Right)
la yer having many
In this terminology “stages.”
terminology,, the conv In this
convolutional terminology , there is a
olutional net is viewed as a larger num one-to-one
number mapping b
ber of simple layetw een
layers;
ers;
k
evernel
every tensors
ery step and
of pro network
processing
cessing layers. Inasthis
is regarded b oerokinwits
a lay
layer e generally
own right. useThis
thismeans
terminology (Right)
. every
that not
Inythis
“la
“lay terminology
er” has , the convolutional net is viewed as a larger number of simple layers;
parameters.
every step of pro cessing is regarded as a layer in its own right. This means that not every
“layer” has parameters.

342
CHAPTER 9. CONVOLUTIONAL NETWORKS

neigh
neighb b orhoo
orhood. d. Other p opular p o oling functions include the average of a rectangular
neigh
neighb b orhoo
orhood, d, the L 2 norm of a rectangular neighborho neighborhoo o d, or a weigh eighted
ted average
neigh b orhoo d. Other
based on the distance from the cenp opular p o oling
central functions
tral pixel. include the av erage of a rectangular
neighb orhoo d, the L norm of a rectangular neighborho o d, or a weighted average
based In onall the
cases, p o oling
distance helps
from thetocen maketral the representation b ecome appro
pixel. approximately
ximately
invariant to small translations of the input. Inv Invariance
ariance to translation means that if
In all cases,
we translate p o oling
the input byhelps
a smallto amount,
make thethe representation
values of mostb ecome of the pappro
o oledximately
outputs
invariant to small translations of the input.
do not change. See Fig. 9.8 for an example of how this works. In Inv ariance to translation Invmeans
variance thattoif
w
loecal
localtranslate the input
translation canby bea small
a very amount,
useful theprop
values
property of most
erty if we of care
the p omore
oled outputs
ab
about
out
do not change. See Fig. 9.8 for an example
whether some feature is present than exactly where it is. For example, of how this w orks. In v ariance to
lo cal determining
when translationwhether can beanaimage verycontains
useful prop a face,ertywe needif we notcare
knowmore the lo ab out
location
cation
whether
of the ey essome
eyes feature
with pixel-p
pixel-perfect is present
erfect accuracy,, than
accuracy we justexactly
need to wherekno
know w that it is.
thereFor example,
is an ey
eyee on
when determining whether an image
the left side of the face and an eye on the righ contains a face, we need not know
rightt side of the face. In other contexts,the lo cation
of the ey
it is more impes with
importantpixel-p erfect
ortant to preserv accuracy
preservee the lo , cation ofneed
w e
location just to know
a feature. Forthat there isifan
example, weeyweanont
ant
thefind
to left aside of the
corner face and
defined by twano eye
edges onmeeting
the rightatside a sp ofecific
specificthe orien
face. tation,
In other
orientation, wecontexts,
need to
it is
preserv more
preservee the lo imp ortant
cation of the edges well enough to test whether they meet. want
location to preserv e the lo cation of a feature. F or example, if we
to find a corner defined by two edges meeting at a sp ecific orientation, we need to
The use of p ooling can b e viewed as adding an infinitely strong prior that
preserve the lo cation of the edges well enough to test whether they meet.
the function the lay layer
er learns must b e inv invarian
arian
ariantt to small translations. When this
The useisofcorrect,
assumption p ooling can greatly
it can b e viewed improv
improveas eadding an infinitely
the statistical efficiency strong
of the prior
net
netwwthat
ork.
the function the layer learns must b e invariant to small translations. When this
Po oling is
assumption ov
overer spatial
correct, regions
it can produces
greatly improvinv invariance
ariance
e the to translation,
statistical efficiency of buttheif net
we wpork.
o ol
over the outputs of separately parametrized con conv volutions, the features can learn
whic
which P o oling ov er spatial regions
h transformations to b ecome inv produces
invariant inv ariance
ariant to (see Fig. to translation,
9.9). but if we p o ol
over the outputs of separately parametrized convolutions, the features can learn
Because p ooling summarizes the resp responses
onses over a whole neigh neighb b orhoo
orhood, d, it is
which transformations to b ecome invariant to (see Fig. 9.9).
p ossible to use fewer p o oling units than detector units, by rep reporting
orting summary
Because p ooling summarizes the resp onses
statistics for p ooling regions spaced k pixels apart rather than 1 pixel o ver a whole neigh b orhoo
apart.d, itSeeis
p ossible
Fig. 9.10 tofor use fewer p o oling
an example. unitsvthan
This impro
improv es thedetector
computational units, efficiency
by rep orting of thesummary
netw
networkork
statistics for
b ecause the next layp ooling regions spaced k pixels apart
er has roughly k times fewer inputs to pro
layer rather than 1 cess. When See
pixel
process. apart. the
Fig.
n um
umb 9.10
b er offorparameters
an example.inThis the impro
next vlay es er
layer theiscomputational
a function of efficiencyits inputofsize the(sucnetw
(such horkas
b ecause the
when the next lay next
layerlay er has roughly k times fewer inputs
er is fully connected and based on matrix multiplication) this to pro cess. When the
n um b er of parameters
reduction in the input size in thecannextalsolay er is in
result a improv
function
improved edofstatistical
its inputefficiency
size (suchand as
when
reduced thememory
next layrequiremen
er is fully ts
requirements connected
for storing andthe based on matrix multiplication) this
parameters.
reduction in the input size can also result in improved statistical efficiency and
For many
reduced memory tasks, p o oling ts
requiremen is for
essential
storingfor thehandling
parameters. inputs of varying size. F For
or
example, if we wan antt to classify images of variable size, the input to the classification
la
lay F or many
yer must hav tasks,
havee a fixed p osize.
oling This
is essential
is usually foraccomplished
handling inputs by vof varying
arying size.ofFan
the size or
example,
offset b et
etwif we wpan
ween t to classify
o oling regionsimages
so thatofthe variable size, the lay
classification input
layer to the
er alwa
always ys classification
receives the
la y er
same num must
umb hav e a fixed size. This is usually accomplished
b er of summary statistics regardless of the input size. For example, b y v arying the size ofthe an
offset b et w
final p ooling lay een p
layero oling regions
er of the net netw so
work ma that
may the classification lay er alwa
y be defined to output four sets of summary ys receives the
same n um b er
statistics, one for eacof summary
eachh quadranstatistics regardless
quadrantt of an image, regardless of the input size.image
of the For example,
size. the
final p ooling layer of the network may be defined to output four sets of summary
Some theoretical work giv gives
es guidance as to which kinds of p o oling one should
statistics, one for each quadrant of an image, regardless of the image size.
343as to which kinds of p o oling one should
Some theoretical work gives guidance
CHAPTER 9. CONVOLUTIONAL NETWORKS

POOLING STAGE

... 1. 1. 1. 0.2 ...

... 0.1 1. 0.2 0.1 ...

DETECTOR STAGE

POOLING STAGE

... 0.3 1. 1. 1. ...

0.3 0.1 1. 0.2


... ...

DETECTOR STAGE

Figure 9.8: Max po pooling


oling introduces in inv
variance. (T op) A view of the middle of the output
(Top)
of a conv
convolutional
olutional la lay
yer. The bottom ro row
w shows outputs of the nonlinearity
nonlinearity.. The top
Figure
ro
row 9.8: Max po oling introduces inv ariance. (Top) A view
w shows the outputs of max p o oling, with a stride of one pixel betw of the middle
between of the regions
een p ooling output
of a aconv
and olutional
p ooling laywidth
region er. The bottom
of three row shows
pixels. outputs
(Bottom) of of
A view thethenonlinearity
same netw . ork,
network,Theafter
top
row input
the showshas the boutputs of max
een shifted p o oling,
to the right with a stride
by one pixel. ofEvery
one pixel
valuebetw eenbpottom
in the oolingrowregions
has
cand a p ooling
hanged, but onlyregion
halfwidth
of theofvalues the top(Bottom)
threeinpixels. haveeAchanged,
row hav view of bthe samethe
ecause netw
max ork, after
p ooling
the input
units has sensitiv
are only b een shifted
sensitive to maxim
e to the the right
maximum umby oneinpixel.
value Every value
the neighborho
neighborhoo o d,innot
theitsb ottom
exact lo row has
location.
cation.
changed, but only half of the values in the top row have changed, b ecause the max p ooling
units are only sensitive to the maximum value in the neighborho o d, not its exact lo cation.

344
CHAPTER 9. CONVOLUTIONAL NETWORKS

Large response Large response


in pooling unit in pooling unit
Large Large
response response
in detector in detector
unit 1 unit 3

Figure 9.9: Example of le learne


arne
arnedd invarianc
invariances:es: A p o oling unit that p o ols over multiple features
that are learned with separate parameters can learn to be in invvariant to transformations of
Figure
the input. Example
9.9: Here of learne
we show howdainvarianc es: learned
set of three A p o oling unitand
filters thata pmax
o olspoover multiple
oling features
unit can learn
that
to are learned
b ecome inv with
invarian
arian
ariantt toseparate
rotation. parameters
All three can
filterslearn
are to be invariant
intended to transformations
to detect a hand-written of 5.
the
Eac
Each input. Here w e show how a set of three learned filters and a
h filter attempts to match a slightly different orientation of the 5. When a 5 app max p o oling unit can
appearslearn
ears in
to binput,
the ecome theinvarian t toonding
corresp rotation.
corresponding filterAll
willthree
matchfilters are cause
it and intended to detect
a large activ a hand-written
activation
ation in a detector 5.
Each filter
unit. attempts
The max po to match
pooling
oling a slightly
unit then has adifferent
large activorientation
activation of the 5.ofWhen
ation regardless whic
whichha po
5 app
pooling
olingears in
unit
the
w as input,
activ the corresp
activated.
ated. We show onding
herefilter
ho
how w will
the match
net
network
workit and
pro cause tawo
processes
cesses large activation
different in aresulting
inputs, detector
unit.
in twoThe max po
different oling unit
detector unitsthen hasactiv
b eing a large
activated.
ated.activ
The ation
effect regardless
on the po ofoling
whicunit
pooling h poisoling unit
roughly
w as same
the activated.
eitherW e yshow
wa
way . Thishere how the
principle network pro
is leveraged bycesses
maxout two different
netw
networks
orks (Goinputs,
Goodfellowresulting
odfellow et al.,
in two) different
2013a
2013a) and otherdetector
conv units bnet
convolutional
olutional eing
netw activated.
works. Max pThe effect
o oling overonspatial
the popoling unitisisnaturally
ositions roughly
the
in
inv same to
variant either way. This
translation; principle
this multi-c is leveraged
multi-channel
hannel approacby hmaxout
approach is only netw orks (for
necessary learning et
Go odfellow al.,
other
2013a) and other convolutional networks. Max p o oling over spatial p ositions is naturally
transformations.
invariant to translation; this multi-channel approach is only necessary for learning other
transformations.

1. 0.2 0.1

0.1 1. 0.2 0.1 0.0 0.1

Figure 9.10: PoPooling downsampling.. Here we use max-p


oling with downsampling max-poo oling with a p ool width of
three and a stride b etw
etween
een po
pools
ols of two. This reduces the representation size by a factor
Figure
of whichPo
two, 9.10: oling with
reduces the downsampling
computational. and
Herestatistical
we use max-p o oling
burden withnext
on the a p ool
laywidth
layer.
er. of
Note
threethe
that andrigh
a stride
tmostbpetw
rightmost een po
ooling ols ofhas
region two. This reduces
a smaller the must
size, but representation sizeifby
b e included weadofactor
not
ofan
w two,
t to which
ant ignore reduces
some ofthe
thecomputational
detector units.and statistical burden on the next layer. Note
that the rightmost p ooling region has a smaller size, but must b e included if we do not
want to ignore some of the detector units.
345
CHAPTER 9. CONVOLUTIONAL NETWORKS

use in various situations (Boureau et al., 2010). It is also p ossible to dynamically


p ool features together, for example, by running a clustering algorithm on the
use
lo in various
locations
cations situations features
of interesting (Boureau(Boureau
et al., 2010
et ).
al.It is also
, 2011 p ossible
). This to dynamically
approach yields a
p ool
differen features together, for example, by running
differentt set of p o oling regions for each image. Another approac a clustering algorithm
approach on the
h is to learn a
lo cations of interesting features ( Boureau et al. , 2011 ). This
single p ooling structure that is then applied to all images (Jia et al., 2012). approach yields a
different set of p o oling regions for each image. Another approach is to learn a
Po oling can complicate some kinds of neural net netwwork architectures that use
single p ooling structure that is then applied to all images (Jia et al., 2012).
top-do
top-down wn information, suc suchh as Boltzmann machines and auto autoenco
enco
encoders.
ders. These
Po oling can complicate some kinds of neural
issues will b e discussed further when we present these typ net w ork architectures
ypes
es of netw orksthat
networks use
in Part
top-do
I II wn information,
II.. Pooling in conv suc
convolutional h as Boltzmann machines and auto enco
olutional Boltzmann machines is presented in Sec. 20.6. The ders. These
issues
in
inv willebop
verse-lik
erse-like e erations
discussedonfurther
operations p o olingwhen
unitswe present
needed thesedifferentiable
in some typ es of netwnetw
orksorks
in Part
networks will
IbII .
e coPooling
cov in convolutional
vered in Sec. 20.10.6. Boltzmann machines is presented in Sec. 20.6 . The
inverse-like op erations on p o oling units needed in some differentiable networks will
Some examples of complete conv convolutional
olutional netw
network
ork architectures for classification
b e covered in Sec. 20.10.6.
using con convvolution and p ooling are shown in Fig. 9.11.
Some examples of complete convolutional network architectures for classification
using convolution and p ooling are shown in Fig. 9.11.
9.4 Con
Conv volution and P Poooling as an Infinitely Strong
Prior
9.4 Convolution and Pooling as an Infinitely Strong
Prior
Recall the concept of a prior pr
prob
ob
obability
ability distribution from Sec. 5.2. This is a
probabilit
probability y distribution ov over
er the parameters of a model that enco encodes
des our b eliefs
Recall
ab
about the concept of a prior pr obability
out what models are reasonable, b efore we hav distribution from
havee seen any data. Sec. 5.2 . This is a
probability distribution over the parameters of a model that enco des our b eliefs
Priors can b e considered weak or strong dep depending
ending on how concentrated the
ab out what models are reasonable, b efore we have seen any data.
probabilit
probability y densit
density y in the prior is. A weak prior is a prior distribution with high
en Priors
entrop
trop
tropy can as
y, such b eaconsidered weak or strong
Gaussian distribution withdep ending
high on how
variance. concentrated
Such a prior allowsthe
probabilit
the data to y densit
mo
mov ve ythe
in parameters
the prior is.more
A weak prior
or less is a. prior
freely
freely. distribution
A strong prior haswithveryhigh
low
en
en trop
entrop
trop
tropyy , such as a Gaussian distribution
y, such as a Gaussian distribution with lo with high
low v ariance.
w variance. Suc Such
Such a prior allows
h a prior plays a
the data
more activ to mo v e the parameters more or less freely
activee role in determining where the parameters end up. . A strong prior has very low
entropy, such as a Gaussian distribution with low variance. Such a prior plays a
moreAnactiv
infinitely
e role strong prior places
in determining wherezerothe
probabilit
probability
parametersy onendsome up.parameters and says
that these parameter values are completely forbidden, regardless of how muc much h
suppAn
support infinitely
ort the data giv strong
gives prior places
es to those values. zero probabilit y on some parameters and says
that these parameter values are completely forbidden, regardless of how much
We can imagine a conv convolutional
olutional net as b eing similar to a fully connected net,
supp ort the data gives to those values.
but with an infinitely strong prior over its weigh weights.
ts. This infinitely strong prior
sa
saysW e can imagine
ys that the weigh
weights a conv olutional net as b eing similar
ts for one hidden unit must b e identical to atofully
theconnected
weigh
weights ts ofnet,
its
but
neigh
neighbwith an infinitely strong prior o ver
b or, but shifted in space. The prior also sa its weigh
says ts. This infinitely
ys that the weigh eights strong prior
ts must b e zero,
sa ys that the weigh ts for one hidden unit must b e identical
except for in the small, spatially contiguous receptive field assigned to to the weigh
thattshidden
of its
neighbOv
unit. or,erall,
but shifted
Overall, we can in space.
think Theuse
of the prior also olution
of conv says that
convolution asthe weights man
introducing ustinfinitely
b e zero,
except for
strong in probability
prior the small, spatially contiguous
distribution ov
over receptive
er the field assigned
parameters of a lay to that
layer.
er. Thishidden
prior
unit.
sa
says Ov erall, we can
ys that the function the la thinklayof the use of conv
yer should learn con olution
contains as
tains only lo introducing
local
cal in an
interactionsinfinitely
teractions and is
strong prior probability distribution over the parameters of a layer. This prior
says that the function the layer should learn 346 contains only lo cal interactions and is
CHAPTER 9. CONVOLUTIONAL NETWORKS

Output of softmax: Output of softmax: Output of softmax:


1,000 class 1,000 class 1,000 class
probabilities probabilities probabilities

Output of matrix Output of matrix Output of average


multiply: 1,000 units multiply: 1,000 units pooling: 1x1x1,000

Output of reshape to Output of reshape to Output of


vector: vector: convolution:
16,384 units 576 units 16x16x1,000

Output of pooling Output of pooling


Output of pooling to
with stride 4: with stride 4:
3x3 grid: 3x3x64
16x16x64 16x16x64

Output of Output of Output of


convolution+ReLU: convolution+ReLU: convolution+ReLU:
64x64x64 64x64x64 64x64x64

Output of pooling Output of pooling Output of pooling


with stride 4: with stride 4: with stride 4:
64x64x64 64x64x64 64x64x64

Output of Output of Output of


convolution+ ReLU: convolution+ ReLU: convolution+ ReLU:
256x256x64 256x256x64 256x256x64

Input image: Input image: Input image:


256x256x3 256x256x3 256x256x3

Figure 9.11: Examples of architectures for classification with conv olutional netw
convolutional orks. The
networks.
sp
specific
ecific strides and depths used in this figure are not advisable for real use; they are
Figure 9.11:
designed Examples
to be of architectures
very shallow in order to forfitclassification
onto the page. withReal
convolutional
conv
convolutionalnetworks.
olutional net
netw The
works
sp ecific
also oftenstrides
inv
involvand
olv
olve depths used
e significant in this
amoun
amounts ts offigure are notunlike
branching, advisable for real
the chain use; theyused
structures are
designed to be
here for simplicity very
simplicity.. (Lshallow in
eft) A con
(Left) order to
convolutional fit onto
volutional netnetw the page.
work that pro Real conv
processes olutional net
cesses a fixed image size. w orks
also often
After involvebsignificant
alternating etw
etween
een con
conv amoun
volutionts and
of branching,
p o oling forunlike
a few the
lay chain
layers,
ers, thestructures
tensor forused
the
herevolutional
con for simplicity
convolutional . (Lmap
feature eft) is
A reshap
convolutional
reshaped network
ed to flatten out that pro cesses
the spatial a fixed image
dimensions. The size.
rest
After
of thealternating
net
netwwork is an b etw een confeedforward
ordinary volution andnet p owoling
netw for a few as
ork classifier, laydescrib
ers, the
described edtensor for the
in Chapter 6.
convolutional
(Center) A con
convfeature
volutional mapnet is
networkreshap
work thatedprotocesses
flatten
processes a vout the spatial
ariable-sized dimensions.
image, The rest
but still maintains
aoffully
the net work is section.
connected an ordinary
This feedforward
net work usesnet
network a pwooling
ork classifier,
op
operationaswith
eration describ ed in Chapter
variably-sized 6.
p o ols
(Center) A
but a fixed num con v
umbolutional
b er of po net
pools,work that pro
ols, in order to pro cesses
provide a variable-sized image, but still
vide a fixed-size vector of 576 units to the maintains
a fullyconnected
fully connectedpsection.
ortion ofThisthe net
net
netwwork
work.uses a p ooling
(Right) A conv opolutional
eration with
convolutional netw variably-sized
networkork that doesp onot ols
but
ha
havveaany
fixed num
fully b er of poweigh
connected ols, int order
weight lay er. to
layer. providethe
Instead, a fixed-size
last conv vector oflay
convolutional
olutional 576er units
layer to one
outputs the
fully connected
feature map p er pclass.
ortionThe of the
mo
modelnetw
del ork. (Right)
presumably A conv
learns olutional
a map of ho
howwnetw orkeach
likely thatclass
doesisnotto
oha ve any
ccur fullyspatial
at each connectedlo weighA
location.
cation. t veraging
layer. Instead, the map
a feature last conv
downolutional
to a singlelayer outputs
value one
provides
feature
the map p er
argument to class. The moclassifier
the softmax del presumably
at the top.learns a map of how likely each class is to
o ccur at each spatial lo cation. Averaging a feature map down to a single value provides
the argument to the softmax classifier at the 347
top.
CHAPTER 9. CONVOLUTIONAL NETWORKS

equiv
equivariant
ariant to translation. Lik Likewise,
ewise, the use of p o oling is an infinitely strong prior
that eac
eachh unit should b e in inv
varian
ariantt to small translations.
equivariant to translation. Likewise, the use of p o oling is an infinitely strong prior
Of course, implemen
implementing ting a conconvvolutional net as a fully connected net with an
that each unit should b e invariant to small translations.
infinitely strong prior would b e extremely computationally wasteful. But thinking
of aOfcon
convcourse, implemen
volutional net asting a con
a fully volutionalnet
connected netwith
as aan fully connected
infinitely strongnetprior
withcanan
infinitely
giv
givee us somestrong prior
insigh
insights would
ts in
into
to ho
howbwe con
extremely
convvolutional computationally
nets work. wasteful. But thinking
of a convolutional net as a fully connected net with an infinitely strong prior can
One key insight is that con conv volution and p o oling can cause underfitting. Like
give us some insights into how convolutional nets work.
an
anyy prior, conv
convolution
olution and p o oling are only useful when the assumptions made
One key insight
by the prior are reasonably is that con volution
accurate. If and
a task p orelies
oling can cause underfitting.
on preserving Like
precise spatial
any prior, conv
information, olution
then usingand p o oling
p o oling on are
all only useful
features canwhen
increasethe assumptions
the training madeerror.
b y the
Some con prior
conv are reasonably
volutional net netw accurate.
work arc If
architectures a task relies on preserving
hitectures (Szegedy et al., 2014a) are designed to precise spatial
information, then using
use p o oling on some channels but p o oling onnot
all onfeatures
other can increase
channels, in the
order training
to get error.
b oth
Some con
highly in invv olutional net w ork arc hitectures ( Szegedy et al. ,
variant features and features that will not underfit when the translation 2014a ) are designed to
use
in
inv p o olingprior
variance on some channels
is incorrect. Whenbut anot
taskoninv other
olves channels,
involves incorp
incorporating in order
orating to get bfrom
information oth
vhighly invariant
ery distan
distant t lo features
locations
cations andinput,
in the features
thenthat
thewill
priornotimp underfit
osed bywhen
imposed conv the translation
convolution
olution ma
may y be
in variance
inappropriate. prior is incorrect. When a task inv olves incorp orating information from
very distant lo cations in the input, then the prior imp osed by convolution may b e
Another key insigh insightt from this view is that we should only compare con convvolu-
inappropriate.
tional momodels
dels to other conv convolutional
olutional momodels
dels in b enchmarks of statistical learning
Another key
p erformance. Mo insigh
Models t from
dels that do this viewconv
not use is that
convolution we would
olution shouldbonly e ablecompare
to learncon volu-
even if
tional
we p erm mo
ermuteddels to other conv olutional
uted all of the pixels in the image. F mo dels in
For b enchmarks of statistical
or many image datasets, there are learning
p erformance.
separate b enc Mo
enchmarks dels
hmarks for mothat do
models not use conv olution
dels that are permutation would b e able
invariant andtomust
learndiscov
even er
discoverif
w e pconcept
the ermutedofall top of the via
topology
ology pixels in theand
learning, image.
mo
models Forthat
dels many ha
havvimage
e the knodatasets,
wledge there
knowledge are
of spatial
separate b enchard-co
relationships hmarks ded
hard-coded for moin
intodels
to themthatbyare permutation
their designer. invariant and must discover
the concept of top ology via learning, and mo dels that have the knowledge of spatial
relationships hard-co ded into them by their designer.
9.5 Varian
ariants
ts of the Basic Con
Conv
volution Function

9.5 discussing
When Variants convofolution
the Basic
convolution in the con Contextvof
context olution
neural netw Function
networks,
orks, we usually do
not refer exactly to the standard discrete con conv volution op operation
eration as it is usually
When
understodiscussing
understood conv olution in the con text of
od in the mathematical literature. The functions usedneural netw orks,
in w e usually
practice do
differ
not
slighrefer exactly
tly.. Here
slightly
tly we to the standard
describ
describe discrete con
e these differences invdetail,
olutionand op eration as it
highlight is usually
some useful
understo
prop
properties od in the mathematical literature.
erties of the functions used in neural net The
netw functions
works. used in practice differ
slightly. Here we describ e these differences in detail, and highlight some useful
propFirst,
ertieswhen
of thewefunctions
refer to conv
convolution
usedolution in the
in neural netcontext
works. of neural netw
networks,
orks, we usually
actually mean an op operation
eration that consists of many applications of con convvolution in
First,This
parallel. wheniswe refer toconv
b ecause conv olutionwith
convolution
olution in the context
a single of neural
kernel netwextract
can only orks, weone
usually
kind
actually mean
of feature, alb an
albeit op eration that
eit at many spatial lo consists
locations.of many applications
cations. Usually we wan of
wantt eac
eachcon v
h layolution
layer in
er of our
parallel.
net
netwwork toThis is b ecause
extract man
many yconv olution
kinds with a single
of features, at man
manykyernel
lo can only extract one kind
locations.
cations.
of feature, alb eit at many spatial lo cations. Usually we want each layer of our
netwAork
dditionally
dditionally, , theman
to extract input is usually
y kinds not just
of features, at aman
gridy of real values. Rather, it is a
lo cations.
Additionally, the input is usually not
348just a grid of real values. Rather, it is a
CHAPTER 9. CONVOLUTIONAL NETWORKS

grid of vector-v
ector-valued
alued observ
observations.
ations. F For
or example, a color image has a red, green
and blue in intensit
tensit
tensity y at eaceach h pixel. In a multila multilayer yer conv
convolutional
olutional netw network,
ork, the input
grid of v ector-v
to the second lay alued
layer observ ations.
er is the output of the first layF or example, er, which usually hasa the
layer, a color image has red,output
green
andman
of blue
many intensity conv
y different at eac h pixel.atIneach
convolutions
olutions a multila yer conv
p osition. When olutional
working netw
withork,images,
the input we
to the second lay er is the output
usually think of the input and output of the conv of the first lay er,
convolution which usually has
olution as b eing 3-D tensors, with the output
of man
one y different
index in to theconv
into olutions
different at each
channels and p osition.
tw
two o indicesWhen into working with co
the spatial images,
coordinates
ordinates we
usually
of eac
each h think
channel. of theSoftinput
Softw wareand output of the conv
implementations olution
usually work as in
b eing
batc
batch3-D tensors,
h mode, with
so they
one actually
will index into usethe4-Ddifferent
tensors,channels
with theand two axis
fourth indices into the
indexing spatial examples
different co ordinates in
of eac
the batc h channel.
batch, Soft w are
h, but we will omit the batc implementations
batch usually work in batc
h axis in our description here for simplicit h mode, so
simplicity they
y.
will actually use 4-D tensors, with the fourth axis indexing different examples in
the Because
batch, but conv
convolutional
weolutional
will omitnet netw
theworks
batchusually
axis in use our multi-c
multi-channel
description hannel hereconv
convolution,
forolution,
simplicitthe y.
linear opoperations
erations they are based on are not guaranteed to b e comm commutativ
utativ
utative, e, ev
even
en if
Because conv
kernel-flipping is olutional
used. These netw orks usually
multi-c
multi-channelhannel op useerations
multi-care
operations hannelonlyconv
comm olution,
commutativ utativ
utative the
e if
linear
eac
each h opop erations
operation they
eration has the same numare based on
umber are not guaranteed to b
ber of output channels as input channels. e comm utativ e, ev en if
kernel-flipping is used. These multi-channel op erations are only commutative if
Assume we hav havee a 4-D kernel tensor with elemen elementt K giving the connection
each op eration has the same number ofK output channelsi,j,k,l as input channels.
strength b et etwween a unit in channel i ofKthe output andK a unit in channel j of the
Assume
input, with an w e hav e a 4-D
offset of kkernel
ro
rowsws andtensor with elemen
l columns b et wteen the giving
etw outputthe connection
unit and the
strength b et w een a unit in channel
input unit. Assume our input consists of observ i of the output
observed and a unit in channel
ed data V with element Vi,j,k giving j of the
input, with of an
theoffset
inputofunit k rowithin
ws andchannel l columnsi at ro b et
wwjeen the output unit and our the
the value row V column k . Assume
and V
input unit.
output Assume
consists of Zour with input
the consists
same format of observ
as Ved . Ifdata
Z is pro with
produced
ducedelement
by conv
convolvinggiving
olving K
the value of the
across V without flipping input unit within
K, then channel i at ro w j and column k . Assume our
Z V Z K
output consists of with the same X format as . If is pro duced by convolving
V K
Zi,j,k, = (9.7)
across without flipping then V l,j +m−1,k +n−1Ki,l,m,n
Z l,m,n V K
= (9.7)
where the summation over l , m and n is ov overer all values for which the tensor indexing
op erations inside the summation is valid. In linear algebra notation, we index into
operations
where
arra
arraysysthe
usingsummation
a 1 for the over l, m
first and. nThis
entry
entry. is ovnecessitates
er all values the for which
− 1 in thethetensor
ab
abo ove indexing
form
formula.
ula.
op erations inside the
Programming languages such as summation X is v alid. In linear algebra notation,
C and Python index starting from 0, rendering we index into
arra
the abys
aboove expression even simpler.. This necessitates the 1 in the ab ove formula.
using a 1 for the first entry
Programming languages such as C and Python index starting − from 0, rendering
We may wan wantt to skip ov overer some positions of the kernel in order to reduce the
the ab ove expression even simpler.
computational cost (at the exp expense
ense of not extracting our features as finely). We
W e may wan t to skip
can think of this as downsampling the ov er some positions
outputofofthe thekernel
full con invorder
conv olutiontofunction.
reduce the If
computational
we wan cost
antt to sample only ev (at the exp ense
ery s pixels in eac
every of not
eachextracting
h direction in the output, then we cane
our features as finely). W
can think
defined a doofwnsampled
this as downsampling
downsampled con
conv volutionthe outputc of
function suc
suchthe full convolution function. If
h that
we want to sample only every s pixels Xin each direction in the output, then we can
Z
defined a doi,j,k = c ( K ,
wnsampled coni,j,kV , s ) =
volution function Vl,(j −1)c ×suc
s+m, k −1)×s+nKi,l,m,n .
h (that (9.8)
Z K V l,m,n V K
= c( , , s) = . (9.8)
We refer to s as the stride of this downsampled conv convolution.
olution. It is also p ossible
to define a separate stride for each direction of motion. See Fig. 9.12 for an
W e refer to s as the stride of this downsampled convolution. It is also p ossible
illustration.
to define a separate stride for each Xdirection of motion. See Fig. 9.12 for an
349 
illustration.
CHAPTER 9. CONVOLUTIONAL NETWORKS

s1 s2 s3

Strided
convolution

x1 x2 x3 x4 x5

s1 s2 s3

Downsampling

z1 z2 z3 z4 z5

Convolution

x1 x2 x3 x4 x5

Figure 9.12: Con


Conv volution with a stride. In this example, we use a stride of two. (T (Top)
op)
Con
Convolution
volution with a stride length of tw twoo implemented in a single op operation.
eration. (Bottom)
Figure
Con
Conv 9.12: with
volution Conavolution with athan
stride greater stride. In this
one pixel example, we use
is mathematically a stride
equiv
equivalent of conv
alent to (Top)
two.olution
convolution
Convolution
with with
unit stride aw
follo
followstride
ed bylength
do of two implemented
downsampling.
wnsampling. Ob
Obviously
viously in atw
viously,, the single
o-stepopapproac
two-step approach (Bottom)
eration.h in
inv
volving
Con
do volution with
downsampling
wnsampling is acomputationally
stride greater than one pixelb ecause
wasteful, is mathematically
it computes equiv alentvalues
many to conv olution
that are
with unit stride
then discarded. follo w ed by do wnsampling. Ob viously , the tw o-step approac h involving
downsampling is computationally wasteful, b ecause it computes many values that are
then discarded.

350
CHAPTER 9. CONVOLUTIONAL NETWORKS

One essential feature of an anyy conv


convolutional
olutional netw network ork implemen
implementation tation is the ability
to implicitly zero-pad the input V in order to make it wider. Without this feature,
the Onewidth essential
of the feature of any conv
representation olutional network implementation is the ability
Vshrinks by one pixel less than the kernel width
to implicitly
at eac
each h lalayyer.zero-pad
Zero padding the input the inputin orderallowsto makeus toitcontrol
wider. the Without
kernelthis feature,
width and
the width of the
the size of the output indep representation
independen
enden shrinks
endently tly by one pixel less than
tly.. Without zero padding, we are forced to the kernel width
cat
ho eac
osehbla
hoose etywer.
etw eenZero padding
shrinking the the input
spatial allows
extent us to
of the net
netwcontrol
work rapidlythe kernel widthsmall
and using and
kthe size ofoth
ernels—b
ernels—both thescenarios
output indep endently. Without
that significantly limit the zero padding,
expressive pow
powerweofare
er theforced
netw
network.to
ork.
cho ose
See Fig.b et ween
9.13 forshrinking
an example. the spatial extent of the network rapidly and using small
kernels—b oth scenarios that significantly limit the expressive power of the network.
Three sp special
ecial cases of the zero-padding setting are worth men mentioning.
tioning. One is
See Fig. 9.13 for an example.
the extreme case in which no zero-padding is used whatso whatsoever,ever, and the con conv volution
Three sp
kernel is only allow ecial cases
allowed of the zero-padding
ed to visit p ositions where the en setting are
entire w orth men
tire kernel is con tioning.
contained
tained en One
entirelyis
tirely
the extreme
within case inIn
the image. which
MA
MATLAB no zero-padding
TLAB terminology
terminology, is used
, thiswhatso
is calledever, andconv
valid the olution.
convolution
convolution. In
kernel is only allow ed to visit p ositions where
this case, all pixels in the output are a function of the same num the en tire kernel is con
numb tained en tirely
b er of pixels in
within
the input, the so image. In MATLAB
the b ehavior terminology
of an output pixel ,isthis is calledmore
somewhat validregular.
convolution.
How
Howev ev In
ever,
er,
this case, all pixels in the
the size of the output shrinks at each lay output are a function
layer. of the same num
er. If the input image has width m and b er of pixels in
the input,
the kernel so hasthe b ehavior
width k , theof output
an output willpixel
b e ofiswidth
somewhat m − kmore + 1. regular.
The rateHow of ev er,
this
the
shrink size
shrinkage of the output shrinks at each lay er. If
age can b e dramatic if the kernels used are large. Since the shrink the input image has width m
shrinkage and
age is
the kernel
greater than has0,width
it limits k , the
the num
output
numb b erwill b e of
of conv
convolutionalwidth m
olutional lay
layers
ersk +that
1. The
can rate of this
b e included
shrink
in the agenet
netw can
w ork.b eAs dramatic
la
layyers areif the
added,kernelsthe used
spatial aredimension
large.− Since the netw
of the shrink
networkorkage is
will
greater
ev
even
en
entually
tuallythan 0, ittolimits
drop 1 × 1the , atnum
which b er pofoint
conv olutional lay
additional layers
ers that
layers cannotcanmeaningfully
b e included
in the net w
b e considered conv ork. As la
convolutional.yers are added,
olutional. Another sp the
special spatial dimension of the
ecial case of the zero-padding setting netw ork will is
eventually
when just enoughdrop to 1 1, at which
zero-padding is added p oint additional
to keep the sizelay ofers
thecannot
outputmeaningfully
equal to the
b e considered
size of the input. conv MA olutional.
MATLAB ×
TLAB Another
calls this samesp ecial conv case
convolution. of theInzero-padding
olution. this case, thesetting
netw
network is
ork
when
can con just
contain
tainenough
as many zero-padding
conv
convolutional is added
olutional lay
layersto as
ers keep thetheav size
available
ailableof the output can
hardware equalsuppto the
support,ort,
size of
since the opthe input.
operation MA
eration of convTLAB calls
convolution
olution dothis
doessame
es not mo conv
modify olution. In this case, the
dify the architectural p ossibilities netw ork
acan contain
vailable to asthemany
next conv la
layyer.olutional
Ho
How wev
ever,laythe
er, ers as inputthe pixels
available near hardware
the b ordercaninfluence
supp ort,
since
few
fewerer the
output op eration
pixels ofthan convtheolution
inputdopixels
es not near mo dify thethe cen architectural
ter. This canp ossibilities
center. make the
av ailable to the next
b order pixels somewhat underrepresenla yer. Ho w ev
underrepresented er, the input
ted in the mo pixels
model. near the
del. This motiv b order
motivates influence
ates the other
few er output
extreme case, whic pixels
which h MA than
MATLAB the input
TLAB refers to as fulpixels near the
fulll convolution cen ter. This can
onvolution,, in which enough zero make the
zeroeses
b order
are addedpixels for somewhat
ev
every
ery pixel underrepresen
to b e visited ted
k in
times the inmo del.
eac
each h This motiv
direction, ates the
resulting other
in an
extremeimage
output case, whic h MAm
of width TLAB
+ k −refers
1. Intothis as case,
ful l convolution
the output, in which
pixels enough
near the bzero
orderes
are added
are a function for evoferyfew
fewerpixel to b ethan
er pixels visitedthekoutput
times in eachnear
pixels direction, resulting
the center. Thisincan an
output
mak image of width m + k 1 . In this case, the
makee it difficult to learn a single kernel that performs well at all p ositions in output pixels near the b order
are acon
the function
conv volutional of few er pixels
feature map. −than
Usuallythe output
the optimal pixelsamoun near the
amount t of center. This can
zero padding (in
make of
terms it test
difficult to learn a single
set classification accuracy) kernel liesthat performs
somewhere b et
etwwell
weenat “v all p ositions
“valid”
alid” and “same” in
the
con
conv vcon volutional feature map. Usually the optimal amount of zero padding (in
olution.
terms of test set classification accuracy) lies somewhere b etween “valid” and “same”
In some cases, we do not actually wan wantt to use conv convolution,
olution, but rather lo locally
cally
convolution.
connected la layyers (LeCun, 1986, 1989). In this case, the adjacency matrix in the
graph In ofsome ourcases,
MLP we do same,
is the not actually
but every wanconnection
t to use conv has olution,
its ownbut rather
weigh
eight,t, sp lo cally
specified
ecified
connected layers (LeCun, 1986, 1989). In this case, the adjacency matrix in the
graph of our MLP is the same, but every 351connection has its own weight, sp ecified
CHAPTER 9. CONVOLUTIONAL NETWORKS

...
...
... ...
... ...

... ...
... ...
... ...
... ...
... ...
... ...
Figure 9.13: The effeeffect
ct of zer
zero
o padding on network size size:: Consider a conv
convolutional
olutional netw
network
ork
with a kernel of width six at every la lay
y er. In this example, we do not use an
anyy p ooling, so
Figure 9.13:
only the conv The effe
convolution ct
olution op of zer
operation o p adding on network
eration itself shrinks the net size
netw : Consider
work size. (T a conv olutional
op) In this conv
(Top) netw
convolutionalork
olutional
with
net a kernel
network,
work, we doof width
not use sixany
at every
implicitlayer.
zeroInpadding.
this example,Thiswe do not
causes theuse any p ooling, so
representation to
only thebyconv
shrink fiv
fiveeolution
pixels op
at eration
eac
each la
h lay itself
y er. shrinks
Starting the
from net
anw ork size.
input of (T op)
sixteen In this
pixels, conv
we olutional
are only
network,
able wee three
to hav
have do notconuse
conv any implicit
volutional la
layyers,zero
andpadding.
the last lay This
layer causes
er do es not the
does everrepresentation
mo
mov ve the kernel,to
shrink
so by fiveonly
arguably pixels
tw
twooatofeac h la
the layyer.
lay ers Starting
are trulyfromcon an input of The
convolutional.
volutional. sixteen
ratepixels, we are only
of shrinking can
able
b to have three
e mitigated convsmaller
by using olutionalkernels,
layers, but
and smaller
the lastkernels
layer doare
es not
less ever move the
expressive andkernel,
some
so arguably
shrinking only two in
is inevitable of this
the kind
layers of are truly convolutional.
architecture. (Bottom) ByThe ratefiv
adding ofe implicit
five shrinking zerocan
zeroeses
b e each
to mitigated
layer, by
layer, we using
preven
preventsmaller kernels, but smaller
t the representation kernels with
from shrinking are less expressive
depth. This alloand
ws some
allows us to
shrinking
mak
make is inevitable
e an arbitrarily in this
deep conv kind
convolutional
olutional net
netw work. (Bottom) By adding five implicit zero es
of architecture.
to each layer, we prevent the representation from shrinking with depth. This allows us to
make an arbitrarily deep convolutional network.

352
CHAPTER 9. CONVOLUTIONAL NETWORKS

by a 6-D tensor W. The indices into W are resp respectiv


ectiv ely: i , the output channel,
ectively:
j , the output row, W k , the output column, W l , the input channel, m, the row offset
b y a 6-D tensor . The indices into
within the input, and n , the column offset within the input. are resp ectiv ely: i , The
the output
linear part channel,
of a
jlo, the
locally output
cally connected la row, k
lay , the output
yer is then giv givencolumn,
en by l , the input channel, m , the row offset
within the input, and n , the column offset within the input. The linear part of a
X
lo cally connected layer is then
Zi,j,k = giv[V enl,jb+ym−1,k +n−1wi,j,k,l,m,n] . (9.9)
Z l,m,n V
= [ w ]. (9.9)
This is sometimes also called unshar unshareed convolution
onvolution,, b ecause it is a similar op operation
eration
to discrete conv convolution
olution with a small kernel, but without sharing parameters across
This
lo is sometimes
locations.
cations. Fig. 9.14 also called unshar
compares lo cal econnections,
local d convolutioncon , b ecause
conv volution, it isanda similar op eration
full connections.
to discrete convolution with aX small kernel, but without sharing parameters across
Lo
Locally
cally connected la layyers are useful when we kno know w that each feature should b e
lo cations. Fig. 9.14 compares lo cal connections, convolution, and full connections.
a function of a small part of space, but there is no reason to think that the same
Lo cally
feature shouldconnected
o ccur acrosslayersallare of useful
space. when we knowifthat
For example, we weachan
antt tofeature
tell ifshould
an image be
a function
is a pictureofofaa small face, we partonlyof space,
need tobut lo
lookthere
ok for theis nomouth
reasonintothe think thathalf
b ottom theofsamethe
feature
image. should o ccur across all of space. For example, if we w an t to tell if an image
is a picture of a face, we only need to lo ok for the mouth in the b ottom half of the
It can also b e useful to make versions of conv convolution
olution or lo locally
cally connected la layyers
image.
in whic
which h the connectivit
connectivity y is further restricted, for example to constrain that each
output It can also biebuseful
channel to make
e a function ofversions
only a subset of conv ofolution
the inputor lochannels
cally connected
l. A common layers
winaywhicto hdothethisconnectivit
is to make y isthefurther
first restricted,
m output channels for example to constrain
connect to onlythat each
the first
output channel i b e a function of only a subset
n input channels, the second m output channels connect to only the second n of the input channels l. A common
w ay tochannels,
input do this isand to make
so on. the SeefirstFig.m output
9.15 for an channels
example. connect
Modeling to only in the first
interactions
teractions
bnet input
etw ween channels,
few channels the allo
second
allows ws them output
netw
network orkchannels
to havhavee connect to only thein second
fewer parameters order to n
input channels,
reduce and so on. See
memory consumption and Fig. 9.15 for
increase an example.
statistical efficiency Modeling
efficiency, , and also interactions
reduces
b et wamountt of computation needed to p erform forward and back-propagation. to
the amoun een few channels allo ws the netw ork to hav e fewer parameters in order It
reduce
accomplishesmemory theseconsumption
goals without andreducing
increase the statistical
num
umb b erefficiency
of hidden , and
units.also reduces
the amount of computation needed to p erform forward and back-propagation. It
Tile
Tiled
accomplishesd convolution
these goals (Gregor and LeCun
without reducing , 2010a
the; nLe umetb er
al.,of2010
hidden) offers a compromise
units.
b et
etw ween a con conv volutional lay layer
er and a lo locally
cally connected lay layer.
er. Rather than learning
Tile d convolution
a separate set of weigh ( Gregor
eights and LeCun
ts at every spatial lo , 2010a cation, we learn )a offers
;
location, Le et al. , 2010 set ofakernels
compromise that
wb et
ew een a through
rotate convolutional as wlay e mo er
mov and
v e a
throughlo cally connected
space. This lay er.
means Rather
that than learning
immediately
a
neighseparate
neighb b oringsetlo of weigh
locations
cations ts ha
will atvevery
hav e differentspatial lo cation,
filters, likee inwaelo
lik learn
cally aconnected
locally set of kernels lay er,that
layer, but
w e rotate through as w e mo v e through space.
the memory requirements for storing the parameters will increase only by a factor This means that immediately
neigh
of theb oring
size oflothis
cationsset will have different
of kernels, rather filters,
than the likesize
in aoflothe
callyen connected
entire
tire output layfeature
er, but
the memory
map. See Fig. requirements for storing of
9.16 for a comparison theloparameters
locally
cally connected will increase
lay
layers, onlycon
ers, tiled byvaolution,
conv factor
of the
and size of this
standard con
conv vset of kernels, rather than the size of the entire output feature
olution.
map. See Fig. 9.16 for a comparison of lo cally connected layers, tiled convolution,
andTstandard
o define tiled conv
convolution
convolution. algebraically,, let k b e a 6-D tensor, where two of
olution algebraically
the dimensions corresp correspond ond to differen
differentt lo locations
cations in the output map. Rather than
ha
havingT o define tiled conv
ving a separate index for eac olution
each algebraically
h lolocation
cation in the , let k b e amap,
output 6-D output
tensor,lo where
cationstwcycle
locations o of
the dimensions
through a set ofcorresp
t differen
different ond t cto differen
hoices of kternel
lo cations
stack in in the
eachoutput
direction.map.If Rather
t is equal than
to
having a separate index for each lo cation in the output map, output lo cations cycle
through a set of t different choices of kernel 353 stack in each direction. If t is equal to
CHAPTER 9. CONVOLUTIONAL NETWORKS

s1 s2 s3 s4 s5

a b c d e f g h i

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

a b a b a b a b a

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

x1 x2 x3 x4 x5

Figure 9.14: Comparison of lo local


cal connections, conv convolution,
olution, and full connections.
(T op) A lo
(Top) locally
cally connected lay layer
er with a patch size of tw twoo pixels. Each edge is lab labeled
eled with
aFigure
unique 9.14: Comparison
letter to show thatof lo calh connections,
eac
each edge is asso convolution,
associated
ciated with itsandownfull
weighconnections.
weight t parameter.
(Top) A lo
(Center) Acally
conv connected
convolutional
olutional lalay
lay erwith
yer withaa kernel
patch width
size of of
twtw
o pixels.
twoo pixels.Each
Thisedge mo is lab
model
del haseled with
exactly
a unique letter
the same connectivitto show
connectivity that
y as the loeac h
locally edge is asso
cally connected lay ciated
layer. with its own weigh t parameter.
er. The difference lies not in which units
(Center)
in
interact A conv
teract with eac
eacholutional layer
h other, but with
in ho
how a kernel
w the widthare
parameters of tw o pixels.
shared. TheThis
lo
locallymoconnected
cally del has exactly
lay
layer
er
the same
has connectivit
no parameter y as theThe
sharing. lo cally
conv connected
convolutional
olutional la lay
lay er.uses
yer Thethedifference
same tw
twolies
o wnot
eighintswhich
eights rep units
repeatedly
eatedly
interactthe
across with eachinput,
entire other, as
butindicated
in how the byparameters
the rep are shared.
repetition
etition The lolab
of the letters cally connected
labeling
eling eac
each layer
h edge.
has no parameter sharing.
(Bottom) A fully connected la The conv
layer olutional
yer resem
resembles la
bles a loyer uses
locally the same
cally connected la tw
lay o w eigh ts rep eatedly
yer in the sense that
across
eac
each the has
h edge entire input,
its own as indicated
parameter (therebyare
thetorep
too etitiontooflab
o many the
label letters labwith
el explicitly elingletters
each edge.
in this
(Bottom)
diagram). AHo fully
Howw connected
ever, it do
does
es la
notyerhavresem
havee the bles a lo
restrictedcally connected
connectivit
connectivity y la
of y er
the in
lo the
locally
cally sense that
connected
eac
la
lay h
yer.edge has its own parameter (there are to o many to lab el explicitly with letters in this
diagram). However, it do es not have the restricted connectivity of the lo cally connected
layer.

354
CHAPTER 9. CONVOLUTIONAL NETWORKS

Output Tensor

Input Tensor
Channel coordinates

Spatial coordinates

Figure 9.15: A conv


convolutional
olutional net
netw
work with the first tw
two
o output channels connected to
only the first tw
twoo input channels, and the second tw
two
o output channels connected to only
Figure
the 9.15:tw
second twoA convolutional
o input channels.network with the first two output channels connected to
only the first two input channels, and the second two output channels connected to only
the second two input channels. 355
CHAPTER 9. CONVOLUTIONAL NETWORKS

s1 s2 s3 s4 s5

a b c d e f g h i

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

a b c d a b c d a

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

a b a b a b a b a

x1 x2 x3 x4 x5

Figure 9.16: A comparison of lo cally connected lay


locally ers, tiled con
layers, volution, and standard
convolution,
con
conv volution. All three ha have
ve the same sets of connections b et etw
ween units, when the same
Figure
size 9.16: A
of kernel is comparison of lo cally
used. This diagram connected
illustrates thelay ers,
use of tiled convolution,
a kernel that is tw twoand standard
o pixels wide.
convdifferences
The olution. Allb et three
weenha
etw ve metho
the the same
methods setsinofho
ds lies connections
how w they share b etween units, (T
parameters. when
op) A
(Top) thelo same
locally
cally
size of kernel
connected lay
layeris has
er used.
noThis diagram
sharing at all.illustrates
We indicate thethat
use each
of a kernel that has
connection is twits
o pixels
own weighwide.
weight t
The
by lab differences
eling eachbconnection
labeling etween the with
metho a ds lies inletter.
unique how they shareTiled
(Center) parameters.
con
conv (T op)
volution hasA aloset cally
of
tconnected
differentt lay
differen er has Here
kernels. no sharing at all. Wthe
we illustrate e indicate
case of that each
t = 22.. One connection
of these has its own
kernels hasweigh
edgest
b
laby lab
labeled eling each connection with a unique
eled “a” and “b,” while the other has edges lab letter. (Center)
labeled Tiled con v olution
eled “c” and “d.” Each time we mov has a
move set
e oneof
tpixel
differen t kernels.
to the right in Here we illustrate
the output, we mo
mov the
ve oncase of
to usingt = 2 . One of
a differen
different these kernels
t kernel. has edges
This means that,
labeeled
lik
like the “a”
lo and connected
locally
cally “b,” while lay
theer,
other
layer, neighhas
neighb edgesunits
b oring lab eled “c” output
in the and “d.”hav Each
have time twe
e differen
different move one
parameters.
pixel to
Unlik
Unlike the lo
e the right
callyinconnected
locally the output, lay we
er,mo
layer, ve on
after wetohavusing
have a differen
e gone throught kernel.
all t aThis means
vailable that,
kernels,
wlik
eecycle
the loback
cally toconnected
the firstlay er, neigh
kernel. If btworing units units
o output in the are
output have differen
separated t parameters.
by a multiple of t
Unlikethen
steps, the they
lo cally connected
share layer,(Bottom)
parameters. after weThav e gone conv
raditional through
convolution
olutionallist equiv
available
alent ktoernels,
equivalent tiled
w
cone cycle
conv volutionbackwithto tthe
= 11.first kernel.
. There is onlyIf one
two kernel
outputand units
it isare separated
applied by a multiple
everywhere, as indicatedof t
steps,
in the then
diagramtheyby share
using parameters. (Bottom)
the kernel with weigh Traditional
weights ts lab eled conv
labeled olution
“a” and “b” is equivalent to tiled
everywhere.
convolution with t = 1. There is only one kernel and it is applied everywhere, as indicated
in the diagram by using the kernel with weights lab eled “a” and “b” everywhere.

356
CHAPTER 9. CONVOLUTIONAL NETWORKS

the output width, this is the same as a lo locally


cally connected la lay
yer.
X
the output width,Zi,j,k this = is the sameVl,j +mas a lo cally connected layer.
−1,k +n−1Ki,l,m,n,j %t+1,k %t+1 , (9.10)
Z l,m,n V K
= , (9.10)
where % is the mo modulo
dulo op eration, with t %t = 0, ( t + 1)%
operation, 1)%tt = 1, etc. It is
straigh
straightforw
tforw
tforward ard to generalize this equation to use a differen differentt tiling range for eac each h
where %
dimension. is the mo dulo op eration, with t % t = 0 , ( t + 1)% t = 1 , etc. It is
straightforward to generalize X this equation to use a different tiling range for each
Both lo locally
cally connected lay layers
ers and tiled conv convolutional
olutional lay layers
ers hav
havee an interesting
dimension.
in
interaction
teraction with max-p max-po o oling: the detector units of these lay layers
ers are driven by
Botht lo
differen
different cally connected
filters. If these filters layerslearn
and tiled convolutional
to detect differentlay ers have an versions
transformed interesting of
in teraction with max-p o oling:
the same underlying features, then the max-p the detector units
max-pooled of these
ooled units b ecome invlay ers are
invarian
ariandriven
ariantt to the by
differen t filters. If these filters
learned transformation (see Fig. 9.9). Conv learn to detect
Convolutionaldifferent
olutional lay layerstransformed
ers are hard-co
hard-codedversions
ded to bofe
the
in
inv same
variant sp underlying
specifically features,
ecifically to translation. then the max-p ooled units b ecome inv arian t to the
learned transformation (see Fig. 9.9). Convolutional layers are hard-co ded to b e
Other op operations
erations b esides conv convolution
olution are usually necessary to implement a
invariant sp ecifically to translation.
con
conv volutional netw network.ork. To p erform learning, one must b e able to compute the
Other
gradien op
gradientt with resp erations
respect
ect to b esides convolution
the kernel, giv
givenen thearegradien
usually
gradient necessary
t with resp ecttotoimplement
respect the outputs. a
con v olutional netw
In some simple cases, this op ork. T o p erform
operation learning, one must
eration can be p erformed using the convb e able to compute
convolution
olution the
gradien
op
operation,t with
eration, butrespmanyect cases
to theofkernel,
interest,given the gradien
including the tcasewithofresp
strideect greater
to the outputs.
than 1,
In some
do not ha hav simple
ve this propcases,
propert ertthis
ertyy. op eration can be p erformed using the conv olution
op eration, but many cases of interest, including the case of stride greater than 1,
Recall that conv convolution
olution is a linear op operation
eration and can thus b e describ described ed as a
do not have this prop erty.
matrix multiplication (if we first reshap reshapee the input tensor into a flat vector). The
Recall
matrix in
inv that
volv edconv
olved is a olution
functionisofa the linearcon
conv op erationkand
volution ernel.can The thus b e describ
matrix is sparse ed as and a
matrix
eac
each multiplication
h elemen
element t of the kernel (if weisfirst reshap
copied e the input
to several tensorofinto
elements the amatrix.
flat vector).
This view The
matrixusintovolv
helps ed eissome
deriv
derive a function of theop
of the other con volution
operations
erations kernel.
needed The matrixa con
to implement is sparse
conv volutional and
eac
net
netw h elemen
work. t of the k ernel is copied to several elements of the matrix. This view
helps us to derive some of the other op erations needed to implement a convolutional
Multiplication by the transp transposeose of the matrix defined by conv convolution
olution is one
network.
suc
such h op
operation.
eration. This is the op operation
eration needed to bac back-propagate
k-propagate error deriv derivatives
atives
Multiplication
through a con conv b y the
volutional lay transp
layer, ose of the matrix
er, so it is needed to train con defined by
conv conv olution
volutional netw is
networks one
orks
suc h
that haop
hav eration. This is
ve more than one hidden la the op eration
lay needed to
yer. This same op bac k-propagate
operation error
eration is also needed if wederiv atives
through
wish a convolutional
to reconstruct layer,units
the visible so itfromis needed to train
the hidden unitscon volutional
(Simard et al.netw
, 1992 orks).
that ha v e more than one
Reconstructing the visible units is an ophidden la yer. This same
operation op eration is
eration commonly used in the mo also needed if
models we
dels
wish
describto
described reconstruct the visible units
ed in Part I I I of this b ook, such as auto from the hidden
autoenco
enco
encoders, units ( Simard
ders, RBMs, and sparse co et al. , 1992
coding. ).
ding.
Reconstructing
Transp
ranspose ose conv the
convolution visible units is an
olution is necessary to construct convop eration commonly
convolutional used in the
olutional versions of those mo dels
describ
mo dels.edLik
models. in ePart
Like the Ikernel
I I of this b ook, op
gradient such as auto
operation,
eration, enco
this ders,gradient
input RBMs, and op sparse can
operation
eration co ding.
be
T ransp
implemen
implemented ose conv olution
ted using a conv is necessary
convolution to construct conv olutional
olution in some cases, but in the general case requires versions of those
mo dels.
a third op Lik e
operationthe kernel gradient
eration to b e implemented. op eration,Carethis input
must b e gradient
tak
takenen toopco eration
coordinate
ordinate canthisbe
implemen
transp
transposeose opted using
operation a conv
eration with the forw olution
forward in some cases, but in the general
ard propagation. The size of the output that the case requires
a thirdoseopop
transp
transpose eration
erationtoshould
operation b e implemented.
return dep endsCare
depends on the must
zerobpadding
e taken ptoolicy co ordinate
and stride this
of
transp ose op eration with the forward propagation. The size of the output that the
transp ose op eration should return dep ends 357 on the zero padding p olicy and stride of
CHAPTER 9. CONVOLUTIONAL NETWORKS

the forw
forwardard propagation op operation,
eration, as well as the size of the forw forward
ard propagation’s
output map. In some cases, multiple sizes of input to forward propagation can
the forw
result in ard
the propagation
same size of op eration,
output map, as so
well
theastransp
the size
transpose ose ofop the
operationforwmust
eration ard propagation’s
b e explicitly
output map. In some cases, multiple
told what the size of the original input was. sizes of input to forward propagation can
result in the same size of output map, so the transp ose op eration must b e explicitly
These three op operations—conv
erations—conv
erations—convolution, olution, backprop from output to weigh weights, ts, and
told what the size of the original input was.
bac
backprop
kprop from output to inputs—are sufficien sufficientt to compute all of the gradients
These
needed three an
to train opyerations—conv
any depth of feedforwardolution,conv backprop
convolutional
olutional from
netw output
ork, asto
network, weigh
well as tots,train
and
bac
con
convkprop
v fromnetw
olutional output
networksorks to inputs—are
with reconstruction sufficien t to compute
functions based on all the
of the gradients
transp
transpose ose of
needed
con
conv to train
volution. See Go an y depth
Goodfellow of feedforward conv
odfellow (2010) for a full deriv olutional ation of the equations intrain
derivation netw ork, as w ell as to the
convolutional
fully networks with reconstruction
general multi-dimensional, multi-example functions
case. Tobased givee aonsense
giv the oftransp
how ose of
these
convolution.
equations work, SeewGo odfellow
e presen
present t the(2010
two )dimensional,
for a full deriv ation
single of the equations
example version here. in the
fully general multi-dimensional, multi-example case. To give a sense of how these
Supp
Suppose ose we wan wantt to train a conv convolutional
olutional netwnetwork ork that incorp
incorporates
orates strided
equations work, we present the two dimensional, single example version here.
con
convvolution of kernel stack K applied to multi-c hannel image V with stride s as
multi-channel
Supp ose w e wan t to train
defined by c(K, V , s) as in Eq.K9.8. Supp a conv olutional
Suppose ose we wan netw
wantt to ork that incorp
minimize orates strided
Vsome loss function
con
J (Vv,olution of kernel stack applied to multi-channel image with stride s as
K ). DuringK V forward propagation, we will need to use c itself to output Z ,
defined
whic
which by c( , , s) as in Eq. 9.8. Supp ose we wantnetwork to minimize
ork andsome used loss function
V hKis then propagated through the rest of the netw to compute Z
J ( cost
the , ).function
During Jforward
. Duringpropagation,
bac
back-propagation,
k-propagation, we will weneed to usee ca itself
will receiv
receive tensortoG output
suc
suchh that,
whic
Gi,j,kh=is ∂Zthen

Jpropagated
(V, K). through the rest of the network and used to compute
G
the cost function J . During back-propagation, we will receive a tensor such that
i,j,k
G To train the V K
= J ( netw
network,
, )ork,
. we need to compute the deriv derivativ
ativ
ativeses with respect to the
weigh
eightsts in the kernel. To do so, we can use a function
To train the network, we need to compute the derivatives with respect to the
weights in the kernel. To∂ do so, we can X use a function
g (G, V , s)i,j,k,l = J (V, K) = Gi,m,n Vj,(m−1)×s+k,(n−1)×s+l . (9.11)
∂ Ki,j,k,l m,n G
G V ∂ V K V
g ( , , s) = K J( , ) = . (9.11)

If this laylayerer is not the b ottom lay layer
er of the netnetw work, we will need to compute
the gradien
gradientt with resp respect
ect to V in order to bac back-propagate
k-propagate the error farther do down.
wn.
If this lay er is not
To do so, we can use a function the b ottom lay er of the net w ork, w e will need to compute
V
the gradient with resp ect to in order X to back-propagate the error farther down.
To do so, we can use a function ∂
h(K, G, s)i,j,k = J (V, K) (9.12)
∂ V i,j,k
K G ∂X V K X X
h( , , s) = V
= J( , ) Kq,i,m,p G q,l,n. (9.12)
(9.13)

l,m n,p q K
s.t. G
= s.t.
( n− 1) ×s + p = k . (9.13)
(l−1)×s+m=j

Auto
Autoencoder
encoder netnetw
works, describ
described
ed in Chapter 14, are feedforw
feedforward
ard netnetw
works
trained to copy their input to theirXoutput. AXsimple example is the PCA algorithm,
thatAuto encoder
copies netwxorks,
its input to andescrib ed in Chapter
approximate 14X, are feedforw
reconstruction r using ard
the net works
function
trained
W >W x to. copy
It istheir input to
common fortheir
moreoutput. A simple
general auto example
autoenco
enco
encoders is the
ders to use PCA algorithm,
multiplication
that copies
by the transp its
transpose input x to
ose of the weighan approximate reconstruction
weightt matrix just as PCA do does.
es. Tr
To using the
o make suc
suchhfunction
mo
models
dels
W W x. It is common for more general auto enco ders to use multiplication
by the transp ose of the weight matrix 358 just as PCA do es. To make such mo dels
CHAPTER 9. CONVOLUTIONAL NETWORKS

con
conv volutional, we can use the function h to p erform the transp transpose ose of the conv
convolution
olution
op
operation.
eration. Suppose we ha havve hidden units H in the same format as Z and we define
acon volutional, we can use the function h to p erform the transp ose of the convolution
reconstruction H Z
op eration. Suppose we have hidden R= units
h(K, Hin, sthe
). same format as and we define (9.14)
a reconstruction
R K H
In order to train the auto autoenco
enco = h(we, will
encoder,
der, , s).receiv
receivee the gradient with resp (9.14)
respect
ect
to R as a tensor E. To train the deco decoder,
der, we need to obtain the gradient with
In order to train the auto enco der, we will receiveencoder, theder,gradient withto resp ect
resp Rect to K. ThisEis given by g (H, E, s). To train the enco
respect we need obtain
to as
the gradien a
gradient tensor . T o train the deco der, we need to obtain the gradient
en by c (K, E, s). It is also p ossible to with
Kt with resp
respectect to H . H ThisE is giv
given
resp ect tiate
differen
differentiateto .through
This is ggiven
using byc gand
( , h,, sbut
). Tthese
o train op the
operationsenco der, we need to obtain
erations
H K E are not needed for the
the
bac gradient with algorithm
back-propagation
k-propagation resp ect toon .an This
any is givennet
y standard bywcork
netw ( , arc , shitectures.
). It is also p ossible to
architectures.
differentiate through g using c and h, but these op erations are not needed for the
Generally
Generally,, we algorithm
back-propagation do not useon only
anyastandard
linear opoperation
eration
net work in arcorder to transform from
hitectures.
the inputs to the outputs in a conv convolutional
olutional lay layer.
er. W Wee generally also add some
biasGenerally
term to each, we output
do not buse only
efore a linear
applying theopnonlinearity
eration in .order
nonlinearity. This to transform
raises from
the question
thehow
of inputs to theparameters
to share outputs inamong a convolutional
the biases.layF er.
For
or loWcally
e generally
locally connected alsolay
adderssome
layers it is
bias term
natural to giv to each
givee eacoutput
each b efore applying the nonlinearity
h unit its own bias, and for tiled con conv . This raises the question
volution, it is natural to
of how to share parameters among the biases.
share the biases with the same tiling pattern as the kernels. F or lo cally connected
For conv lay ers it is
convolutional
olutional
natural
la yers, ittois giv
lay e each
typical tounit
ha
havveits own
one bias,
bias p er and for tiled
channel of thecon volution,
output and itshare
is natural to
it across
share
all lo the biases
locations
cations within witheachthe same
conv
convolutiontiling
olution map.pattern
How
Howev aser,
ev
ever,the kernels.
if the input Fisorofconv
kno olutional
known,
wn, fixed
lay ers, it is typical to ha ve one bias p er
size, it is also p ossible to learn a separate bias at eac channel of the
each h lo output
location and share
cation of the output it across
map.
all lo cations within
Separating the biases ma each
may conv olution map. How ev er, if the input
y slightly reduce the statistical efficiency of the mo is of kno wn,
model, fixed
del, but
size, allo
also it isws
allowsalso
thep ossible
mo
model
del toto correct
learn a for
separate bias atineac
differences theh image
lo cation of the output
statistics at differenmap.
different t
Separating
lo
locations. the biases ma y slightly reduce the statistical
cations. For example, when using implicit zero padding, detector units at theefficiency of the mo del, but
edge allo
also ws the
of the imagemoreceiv
del toecorrect
receive for input
less total differences
and ma inythe
may need image
largerstatistics
biases. at different
lo cations. For example, when using implicit zero padding, detector units at the
edge of the image receive less total input and may need larger biases.
9.6 Structured Outputs

9.6volutional
Con
Conv Structured
netw orksOutputs
networks can be used to output a high-dimensional, structured
ob
object,
ject, rather than just predicting a class lab label
el for a classification task or a real
Con volutional netw orks can be used
value for a regression task. Typically this ob to outputjectaishigh-dimensional,
object structured
just a tensor, emitted by a
ob ject, rather
standard conv than justlay
convolutional
olutional predicting
layer. a class the
er. For example, lab el
mo for
delamigh
model classification
might task S
t emit a tensor or, where
a real
vSalue isforthe a probability
regression task. Typically
that pixel this
(j, k ) of theob ject to
input is just a tensor,
the netw
network emitted
ork b elongs by a
i,j,k Sto class
istandard
S. This allo
conv
wsolutional
allows the mo
model
dellayto
er.lab
For
el example,
label every pixel theinmoandel mighand
image t emit a tensor
draw precise, masks
where
is
that follo the
follow probability that pixel (
w the outlines of individual obj, k ) of the
objects.
jects.input to the netw ork b elongs to class
i. This allows the mo del to lab el every pixel in an image and draw precise masks
One issue that often comes up is that the output plane can b e smaller than the
that follow the outlines of individual ob jects.
input plane, as shown in Fig. 9.13. In the kinds of arc architectures
hitectures typically used for
One issue that often
classification of a single ob comes
object up is that the output plane
ject in an image, the greatest reduction can b e smaller
in thethan the
spatial
input plane, as shown
dimensions of the netw in
network Fig. 9.13 . In the
ork comes from using po kinds of
poolingarc hitectures
oling lay ers with large stride. for
layers typically used In
classification of a single ob ject in an image, the greatest reduction in the spatial
dimensions of the network comes from359 using po oling layers with large stride. In
CHAPTER 9. CONVOLUTIONAL NETWORKS

(1) (2) (3)


Ŷ Ŷ Ŷ

V W V W V

H(1) H(2) H(3)

U U U

Figure 9.17: An example of a recurren recurrentt conv


convolutional
olutional net netwwork for pixel lab labeling.
eling. The
input is an image tensor X, with axes corresp corresponding
onding to image rows, image columns, and
cFigure
hannels9.17:
(red,An example
green, blue).XofThea recurren
goal is to t conv
outputolutional
a tensor netofwork
lab
labelsforY
els Ŷˆpixel
, withlab eling. The
a probabilit
probabilityy
input is an
distribution ov image
over tensor ,
er labels for eac with
each axes corresp onding to image
h pixel. This tensor has axes corresp rows,
corresponding image columns,
onding to image rows, and
ˆ , with a probability
channels
image (red, green,
columns, blue).
and the The goal
different is toRather
classes. output than
a tensor of lab elsY
outputting ŶˆYin a single shot, the
distribution
recurren
recurrent over
t netw orklabels
network for eachrefines
iteratively pixel. its
This tensor Y
estimate ˆ byaxes
has
Ŷ corresp
using onding estimate
a previous to image of Yˆ
rows,
image columns, and the different classes. Rather than outputting ˆ in a single shot, the
Y
as input for creating a new estimate. The same parameters are used for each up updated
dated
recurrent netw
estimate, ork iteratively
and the estimate can refines its estimate
b e refined as man
manyYˆybtimes
y using asawe previous
wish. estimate
The tensor of Y
ˆ
of
as
con input
convvolutionforkernels
creatingU ais new
used estimate.
on each step Thetosame
computeparameters
the hidden arerepresentation
used for each given up dated
the
estimate, and the
input image. The kernel estimate
U can b e refined
tensor V is used to pro as man
produce y times as we
duce an estimate of the lab wish. The
labels tensor of
els given the
convolution
hidden values.kernels
On all isbutused
theonfirst
eachstep,
V stepthe
to compute
kernels W theare
hidden
con
conv representation
volved ov
over
er Yˆ
Ŷ togiven
pro the
provide
vide
input toimage. The kernel
the hidden lay er.tensor
layer. is used
On the first timeto step,
pro duce
thisW an estimate
term is of
replaced thebylab els
zero. given the
Because
hidden
the same values. On allare
parameters butused
the on
first step,
each thethis
step, kernels are conof
is an example volved over Yˆ net
a recurrent to
netw pro
w vide
ork, as
input
describ to
described the hidden
ed in Chapter 10.lay er. On the first time step, this term is replaced by zero. Because
the same parameters are used on each step, this is an example of a recurrent network, as
describ ed in Chapter 10.
order to proproduce
duce an output map of similar size as the input, one can av avoid
oid p o oling
altogether (Jain et al., 2007). Another strategy is to simply emit a low lower-resolution
er-resolution
order to
grid of lab pro
labelsduce an output
els (Pinheiro and Collob map of
Collobert similar size as the
Finally,, in principle,oid
ert, 2014, 2015). Finallyinput, one can av p ocould
one oling
altogether ( Jain
use a p ooling op et al.
operator , 2007 ). Another
erator with unit stride. strategy is to simply emit a low er-resolution
grid of lab els (Pinheiro and Collob ert, 2014, 2015). Finally, in principle, one could
One strategy for pixel-wise lab labeling
eling of images is to pro produce
duce an initial guess
use a p ooling op erator with unit stride.
of the image lab labels,
els, then refine this initial guess using the interactions b etw etween
een
neighOne
neighb strategy for
b oring pixels. Rep pixel-wise
Repeating lab eling
eating this refinemen of images
refinementt step sevis to pro
several duce an
eral times correspinitial
corresponds guess
onds to
of the image
using the same convlab els, then
convolutions refine
olutions at eac this
each initial guess using
h stage, sharing weigh
weights the interactions
ts b etw een the lastetw
etween b layeen
lay ers
neigh
of theb oring
deep net pixels.
(JainRep eteating
al., 2007this).refinemen
This makest stepthesevsequence
eral timesofcorresp onds to
computations
using
p the same
erformed by the conv olutions con
successive at eac
conv h stage, la
volutional sharing
lay weigh
yers with ts btsetw
weigh
weights een the
shared last lay
across layers
layers
ers
of the deep net (Jain et
a particular kind of recurrent netw al. , 2007
network). This makes the
ork (Pinheiro and Collob sequence
Collobert of computations
ert, 2014, 2015). Fig.
p erformed
9.17 sho
shows b y the
ws the arc successive
architecture con
hitecture of suc v olutional
such la
h a recurren yers
recurrentt conwith
conv weigh ts
volutional net shared
netwwork.across layers
a particular kind of recurrent network (Pinheiro and Collob ert, 2014, 2015). Fig.
Once a prediction for eac each h pixel is made, various metho methods ds can b e used to
9.17 shows the architecture of such a recurrent convolutional network.
further process these predictions in order to obtain a segmentation of the image
in toOnce
into regionsa prediction
(Briggmanforet eac al.h pixel; T
, 2009 isuraga
made,etvarious
al.
al.,, 2010metho
; Farabds et
arabetcanetbal.
e used
, 2013to).
al.,
further process these predictions in order to obtain a segmentation of the image
into regions (Briggman et al., 2009; T360 uraga et al., 2010; Farab et et al., 2013).
CHAPTER 9. CONVOLUTIONAL NETWORKS

The general idea is to assume that large groups of contiguous pixels tend to b e
asso
associated
ciated with the same lablabel.
el. Graphical mo
models
dels can describ
describee the probabilistic
The general idea
relationships b et
etwis to assume
ween neigh
neighb that large groups of
b oring pixels. Alternativ contiguous
Alternatively
ely
ely,, the con
convvpixels tend
olutional netto
netw be
work
asso ciated
can with to
b e trained themaximize
same lab el.
an Graphical
appro mo dels
approximation
ximation of can
the describ e the
graphical mo probabilistic
model
del training
relationships
ob
objective b etw een neighb oring pixels. Alternativ
jective (Ning et al., 2005; Thompson et al., 2014). ely , the con volutional network
can b e trained to maximize an approximation of the graphical mo del training
ob jective (Ning et al., 2005; Thompson et al., 2014).
9.7 Data Typ
ypes
es

9.7 dataData
The Typa es
used with conv
convolutional
olutional netwnetwork
ork usually consists of sev several
eral channels,
eac
eachh channel b eing the observ observation
ation of a different quantit quantity y at some p oint in space
The dataSee
or time. used with9.1
Table a conv olutional netw
for examples orktusually
of data yp
ypeses withconsists of sev
different eral channels,
dimensionalities
eachncum
and hannel
umb b er ofb eing the observation of a different quantity at some p oint in space
channels.
or time. See Table 9.1 for examples of data typ es with different dimensionalities
For an example of conv convolutional
olutional net netw works applied to video, see Chen et al.
and numb er of channels.
(2010).
For an example of convolutional networks applied to video, see Chen et al.
So far we hav havee discussed only the case where every example in the train and test
(2010).
data has the same spatial dimensions. One adv advantage
antage to conv
convolutional
olutional netwnetworks
orks
So far we hav
is that they can also pro e discussed cess inputs with varying spatial extents. These kindstest
process only the case where every example in the train and of
data has
input simplythe same
cannot spatial dimensions.byOne
b e represented advantage
traditional, to conv
matrix olutional networks
multiplication-based
is that net
neural they
netw can also
works. Thispro pro cess
provides
vides inputs
a comp with
compelling varying
elling reasonspatial
to useextents.
conv Thesenet
convolutional
olutional kinds
netw
worksof
input
ev
even simply
en when cannot b e represented
computational cost and ovby traditional,
erfitting are notmatrix multiplication-based
significan
significant t issues.
neural networks. This provides a comp elling reason to use convolutional networks
For example, consider a collection of images, where each image has a differen differentt
even when computational cost and overfitting are not significant issues.
width and height. It is unclear how to mo model del such inputs with a weigh eightt matrix of
fixed For example,
size. Con
Conv consider
volution a collection
is straigh
straightforw
tforw of images,
tforward
ard to apply; where
the each image
kernel has aapplied
is simply differenat
width
differen
different and
t num
numbheight.
b er ofIttimes
is unclear
dep how to
depending
ending on mo
thedel such
size inputs
of the with
input, anda the
weigh t matrix
output of
of the
fixed
con
conv size. Con v olution is straigh
volution operation scales accordingly tforw ard
accordingly.. Con to apply;
Convolution the kernel is
volution may b e view simply
viewed ed as matrixa
applied
differen t numb er
multiplication; theof same
timescon depvending
conv olution on the size
kernel of the
induces input, and
a different sizetheof output
doubly ofblothe
blocck
con volution
circulan operation
circulantt matrix for eac eachscales accordingly . Con volution may
h size of input. Sometimes the output of the netw b e view ed as matrix
network
ork is
m ultiplication;
allo
allowwed to hav the same con volution kernel induces a
havee variable size as well as the input, for example if we wandifferent size of doubly blo ck
antt to assign
circulan
a class lab t
labelmatrix for eac h size of input. Sometimes the output
el to each pixel of the input. In this case, no further design work is of the netw ork is
allowed to. hav
necessary
necessary. e variable
In other cases, size asnet
the well
netw as the
work must input,
pro for example
produce
duce if we wanoutput,
some fixed-size t to assign
for
a class lab
example if we wanel to each pixel of the input.
antt to assign a single class lab In this
label case, no further design
el to the entire image. In this case work is
necessary
w e must make . In other
some cases,
additionalthe net worksteps,
design must lik proeduce
like someafixed-size
inserting p o oling la output,
lay
yer whosefor
example if w e w an t to
p ooling regions scale in size prop assign a single class
proportional lab el to the entire image.
ortional to the size of the input, in order to In this case
w e
mainm ust
maintain make
tain a fixed num some
numberadditional design
ber of p ooled outputs. steps,Somelike examples
inserting ofa pthis
o oling
kindlaofyerstrategy
whose
p ooling
are sho
shownwn regions
in Fig.scale
9.11.in size prop ortional to the size of the input, in order to
maintain a fixed number of p ooled outputs. Some examples of this kind of strategy
are Note
shownthat the use
in Fig. 9.11 of. con
conv volution for pro processing
cessing variable sized inputs only mak makeses
sense for inputs that hav havee variable size b ecause they contain varying amounts
Note that the use of convolution for pro cessing variable sized inputs only makes
sense for inputs that have variable size b ecause they contain varying amounts
361
CHAPTER 9. CONVOLUTIONAL NETWORKS

Single channel Multi-c


Multi-channel
hannel
1-D Audio w wa aveform: The axis w wee Sk
Skeleton
eleton animation data: Anima-
Single
con
conv c
volv hannel
olvee over corresp
corresponds
onds to Multi-c
tions ofhannel
3-D computer-rendered
1-D Audio w av eform:
time. We discretize timeThe axisand
we charactersanimation
Sk eleton are generateddata:by Anima-
alter-
con volv e ov er corresp
measure the amplitude of the onds to tions of 3-D computer-rendered
ing the p ose of a “sk “skeleton”
eleton” ov over
er
time.
w aveformWeoncediscretize
p er timetime
step.and characters
time. are generated
At each by alter-
p oint in time, the
measure the amplitude of the ing
p osethe p osecharacter
of the of a “skeleton”
is describoved
described er
waveform once p er time step. time.
by a sp Aecification
t each p oint
specification in time,
of the anglestheof
p ose
eac
each of the character is describ
h of the joints in the charac- ed
ter’s sp
b y a skecification
skeleton.
eleton. Each of the angles in
channel of
eachdata
the of thewe joints
feed tointhe thecon
charac-
convvolu-
ter’s sk
tional mo eleton.
model Each
del represen
represents channel
ts the angle in
the
ab
about data we feed to
out one axis of one join the con
joint.
t. v olu-
2-D Audio data that has b een prepro- tional mo del represen
Color image data: One channel ts the angle
cessed with a Fourier transform: ab out
con
contains
tainsonethe
axisredof pixels,
one join t. the
one
2-D Audio
W e candata that has
transform thebaudio
een prepro-
wa
wave-
ve- Color image
green pixels, data:
and oneOnethe channel
blue
cessed with a Fourier transform:
form into a 2D tensor with dif- con tains the
pixels. The con red pixels,
convolution one
volution kernel the
W e can
feren
ferentt rotransform
rows
ws corresp the
correspondingaudioto
onding wadif-
ve- green
mo
mov pixels,
ves ov
over and one the
er both the horizontal blue
form
ferenttinto
feren a 2D tensor
frequencies and with
differendif-
different t pixels.
and The axes
vertical convolution kernel
of the image,
feren t ro ws corresp onding
columns corresponding to differ- to dif- moves over translation
conferring both the horizontal
equiv
equivari-
ari-
feren
en
ent t frequencies
t p oin
oints and differen
ts in time. Using con
conv t
volu- and vertical axes
ance in b oth directions. of the image,
columns corresponding
tion in the time makes the mo to differ-
model
del conferring translation equivari-
en t p oin
equiv
equivariantts into
ariant time. Using
shifts convolu-
in time. Us- ance in b oth directions.
tion conv
ing in the time makes
convolution
olution acrossthethe mofre-
del
equivariant
quency axistomakes
shifts inthe
time. Us-
model
ing
equiv conv
equivariant olution across
ariant to frequency the
frequency,, so thatfre-
quency
the sameaxis melo makes
melodydy pla
playythe
ed inmodel
a dif-
equiv
feren ariant
ferentt o cta
ctav to frequency
ve pro
produces , so
duces the same that
the sametation
represen
representationmelo dybutpla
atyed in a dif-
a different
feren
heigh t o cta ve
heightt in the net pro
netwduces the
work’s output. same
3-D represen
V olumetric tation but atAa common
data: different Color video data: One axis corre-
height of
source in this
the kind
network’s
of dataoutput.
is med- sp
sponds
onds to time, one to the height
3-D V olumetric data:
ical imaging technology A common
technology,, such as of thevideo
Color videodata: Oneand
frame, axisone
corre-
to
source
CT scans.of this kind of data is med- sp onds to time, one to the
the width of the video frame. height
ical imaging technology, such as of the video frame, and one to
Table 9.1:
CTExamples
scans. of differen
differentt formats of data that can
the width b e used
of the with
video conv
convolutional
frame. olutional
net
netw
works.
Table 9.1: Examples of different formats of data that can b e used with convolutional
networks.
362
CHAPTER 9. CONVOLUTIONAL NETWORKS

of observ
observation
ation of the same kind of thing—differen
thing—differentt lengths of recordings over
time, differen
differentt widths of observ
observations
ations ovover
er space, etc. ConConvvolution dodoes
es not make
of observ ation of the same kind of thing—differen t lengths of recordings
sense if the input has variable size because it can optionally include different over
time, differen
kinds of observ t widths
ations. of
observations. observ
For ationsif ov
example, weerare
space,
pro etc. Con
processing
cessing volution
college do es not make
applications, and
sense if the input has variable size because it can optionally include
our features consist of b oth grades and standardized test scores, but not every different
kinds of tobserv
applican
applicant to ok ations.
took For example,
the standardized test,ifthen
we are pro
it do escessing
does not makecollege
senseapplications,
to con
conv olvee and
volv the
our features
same weigh
eights consist of b oth grades and
ts over b oth the features corresp standardized
corresponding test scores, but not
onding to the grades and the features every
applican
corresp t to
corresponding ok the standardized
onding to the test scores. test, then it do es not make sense to convolve the
same weights over b oth the features corresp onding to the grades and the features
corresp onding to the test scores.
9.8 Efficien
Efficientt Con
Conv
volution Algorithms

9.8
Mo dernEfficien
Modern con
conv t Con
volutional netw
networkv olution
ork applicationsAlgorithms
often inv
involve
olve net
netwworks containing more
than one million units. Po Powerful
werful implementations exploiting parallel computation
Mo dern con v olutional
resources, as discussed in Sec.netw ork 12.1
applications
, are essen often
tial. inv
essential. Ho
How olve
wev netin
ever,
er, works
man
many ycontaining
cases it ismore
also
pthan onetomillion
ossible sp eed units.
speed up con
conv Po werfulbimplementations
volution exploitingconv
y selecting an appropriate parallel
convolution computation
olution algorithm.
resources, as discussed in Sec. 12.1, are essential. However, in many cases it is also
Con
Conv volution is equiv
equivalen
alen
alentt to conv
converting
erting b oth the input and the kernel to the
p ossible to sp eed up convolution by selecting an appropriate convolution algorithm.
frequency domain using a Fourier transform, p erforming p oin oint-wise
t-wise multiplication
Con
of the tw v
twoolution is equiv
o signals, and con alen
conv t to conv
verting bac erting
backk to the time domain the
b oth the input and usingkernel toerse
an inv the
inverse
Ffrequency domain using
ourier transform. For asome
Fourier transform,
problem sizes,p erforming
this can be p oin t-wisethan
faster multiplication
the naive
of the
implemen tw o
implementation signals, and
tation of discrete concon v
converting
volution.bac k to the time domain using an inverse
Fourier transform. For some problem sizes, this can be faster than the naive
When a d-dimensional kernel can b e expressed as the outer pro duct of d
product
implementation of discrete convolution.
vectors, one vector p er dimension, the kernel is called sep separ
ar
arable
able
able.. When the kernel
When a dnaive
is separable, -dimensional
conv
convolutionkernel
olution can b e expressed
is inefficient. It is equivas alent
the outer
equivalent to comppro duct
compose of d
ose d one-
vdimensional
ectors, one vcon ector
conv p er dimension,
volutions with each theofkernel
theseisvectors.
called sepThe arable . When
comp
composedosed the kernel
approach
is separable, naive conv olution is inefficient.
is significantly faster than p erforming one d-dimensional conv It is equiv alent to
convolutioncomp ose d one-
olution with their
dimensional
outer pro
product.
duct.conThe
volutions
kernelwith alsoeach
takesoffewer
theseparameters
vectors. The comp osedasapproach
to represent vectors.
is the
If significantly
kernel is w faster than
elemen
elements p erforming
ts wide in eachone d-dimensional
dimension, then naivconveolution
naive with their
multidimensional
outer
con
conv pro duct.
volution The Okernel
requires also takes
(w d) runtime andfewer
parameterparameters
storagetospace,
represent
whileas vectors.
separable
If the
con
conv volution w
kernel requires
is elemen O (wts×wided) runintime
eachand
runtime dimension,
parameter then naive multidimensional
storage space. Of course,
con v
not ev olution
every
ery con requires
conv O (w ) runtime
volution can b e represen and
represented parameter
ted in this wa waystorage
y. space, while separable
convolution requires O (w d) runtime and parameter storage space. Of course,
not Devising
every confaster
volutionwa
waysys
canof×b ep erforming
representedconconv
invthis
olution
way.or approximate conv convolution
olution
without harming the accuracy of the mo model
del is an activ
activee area of researc
research.h. Ev
Even
en tech-
Devising
niques that impro faster
improv wa ys of p erforming
ve the efficiency of only forw con v olution
forwardard propagation are useful bolution
or approximate conv ecause
without harming the accuracy of the
in the commercial setting, it is typical to dev mo del is
devotean activ e area of researc
ote more resources to deplo h. Ev en
deploymen
ymen
ymenttech-
t of
aniques
net
netw thatthan
work impro tovits
e the efficiency of only forward propagation are useful b ecause
training.
in the commercial setting, it is typical to devote more resources to deployment of
a network than to its training.

363
CHAPTER 9. CONVOLUTIONAL NETWORKS

9.9 Random or Unsup


Unsupervised
ervised Features
T 9.9
ypically
ypically,Random
, the most exp orensive
Unsup
expensive part of ervised
conv
convolutional Features
olutional netw
networkork training is learning the
features. The output la layyer is usually relatively inexp inexpensiv
ensiv
ensivee due to the small num umb b er
T ypically , the most exp ensive
of features provided as input to this la part of conv
lay olutional netw ork training
yer after passing through several lay is learning
layers the
ers of
features. The output la yer is usually relatively
p ooling. When p erforming supervised training with gradien inexp ensiv e due
gradientt descen to
descent, the small
t, ev
every n um
ery gradien b ert
gradient
of features
step requires provided
a complete as input
run of to forward
this layerpropagation
after passing andthrough
backw
backward several
ard layers of
propagation
p ooling. the
through When en p erforming
entire
tire net
netw work.supervised
One waytrainingto reduce withthegradien
cost oft descen
conv t, every gradien
convolutional
olutional netw
networkorkt
step requires a complete run of forward
training is to use features that are not trained in a sup propagation and
supervisedbackw ard
ervised fashion. propagation
through the entire network. One way to reduce the cost of convolutional network
Thereis are
training to use three basic that
features strategies
are notfor obtaining
trained in a sup conv
convolution
olution
ervised kernels without
fashion.
sup
supervised
ervised training. One is to simply initialize them randomly randomly.. Another is to
designThere them areby three
hand, basic
for strategies
example by forsetting
obtaining eachconv kernelolution kernels
to detect without
edges at a
sup ervised training. One
certain orientation or scale. Finally is to simply initialize them randomly
Finally,, one can learn the kernels with an unsup . Another
unsupervised is
ervisedto
design them
criterion. Forbyexample,
hand, for example
Coates et al.by(2011setting eachkk-means
) apply ernel toclustering
detect edges at a
to small
certainpatches,
image orientation then or use
scale.
eachFinally , onecentroid
learned can learnasthe kernels
a con
conv with kernel.
volution an unsupP ervised
Part
art I I I
criterion.
describ
describes F or example,
es many more unsup Coates
unsupervised et al. ( 2011 ) apply k -means
ervised learning approaches. Learning the features clustering to small
imageanpatches,
with unsup
unsupervised then use
ervised each learned
criterion allows them centroid to b easdetermined
a convolution kernel. from
separately Part theIII
describ
classifier laes many
lay more unsup
yer at the top of the arc ervised learning
architecture. approaches. Learning
hitecture. One can then extract the features for the features
with an unsup
the entire ervised
training set criterion
just once,allowsessen them constructing
essentially
tially to b e determined a newseparately
training set fromfor the
the
classifier
last lalay la y er at the
yer. Learning the last latop of the
lay architecture. One
yer is then typically a con can then
conv extract the features
vex optimization problem, for
the entire the
assuming training
last la set
layyerjust once, essenlik
is something tially
like constructing
e logistic regression a newor antraining
SVM.set for the
last layer. Learning the last layer is then typically a convex optimization problem,
Random filters often work surprisingly well in conv convolutional
olutional net netwworks (Jarrett
assuming the last layer is something like logistic regression or an SVM.
et al., 2009; Saxe et al., 2011; Pinto et al., 2011; Co Cox x and Pinto, 2011). Saxe et al.
(2011 Random
) sho
show wed filters
that often
la
layyerswconsisting
ork surprisinglyof con
conv vwell in conv
olution folloolutional
wing by net
following worksnaturally
p o oling (Jarrett
et al., 2009 ; Saxe
b ecome frequency selectiv et al., 2011 ; Pinto et
selectivee and translation in al. , 2011
inv ;
varian Co x and Pinto , 2011
ariantt when assigned random weigh ). Saxe et al.
weights.ts.
(They
2011)argue
showed that la yers consisting
that this provides an inexp of con
inexpensive v olution follo wing b y p
ensive way to choose the architecture of o oling naturally
ab ecome
conv frequencynetw
convolutional
olutional selectiv
network:ork:efirstand ev translation
aluate theinvparian
evaluate t when assigned
erformance of several random
conv weights.
convolutional
olutional
They
net
netw workargue
arc that this provides
architectures
hitectures by training an inexp
only the ensivelastwla ay
layyer,to then
choose takethethe architecture
b est of these of
a
arc conv olutional
architectures
hitectures andnetw ork:
train thefirst
en
entireevaluate
tire arc the p erformance
architecture
hitecture using a more of several
exp
expensiv
ensivconv
ensive olutional
e approac
approach. h.
network architectures by training only the last layer, then take the b est of these
An in intermediate
architectures termediate
and train approac
approach
the en h tire
is toarc
learn
hitecturethe features,
using a but moreusing metho
methods
exp ensiv ds that h.
e approac do
not require full forward and back-propagation at every gradient step. As with
An yin
multila
ultilay ertermediate
p erceptrons, approac
we use h greedy
is to learnla
lay the features,
yer-wise but using
pretraining, to trainmethothedsfirst thatlaydo
layer
er
not require full forward and back-propagation
in isolation, then extract all features from the first lay at every
layer gradient step.
er only once, then train the As with
m ultila
second la y er
lay p erceptrons, we use greedy la y er-wise
yer in isolation given those features, and so on. Chapter pretraining, to train the describ
8 has first layed
described er
in
ho
howwisolation,
to p erform thensup extract
ervisedallgreedy
supervised featureslay from the
layer-wise
er-wise first layer only
pretraining, and Ponce,
art I II then train this
extends the
second
to greedy laylaeryer-wise
lay in isolation given those
pretraining usingfeatures,
an unsup and
unsupervised
ervisedso on. Chapter
criterion at 8eac has
eachh laydescrib
layer.
er. The ed
how to p erform
canonical example supof ervised
greedygreedy
lay layer-wise
layer-wise
er-wise pretraining,
pretraining of a convandolutional
Part I II mo
convolutional extends
model del is this
the
to
con
convgreedy la y er-wise
volutional deep b elief net pretraining
netw using an unsup
work (Lee et al., 2009). Con ervised criterion
Conv at
volutional net eac h
netw lay er.
works offer The
canonical example of greedy layer-wise pretraining of a convolutional mo del is the
convolutional deep b elief network (Lee et al., 2009). Convolutional networks offer
364
CHAPTER 9. CONVOLUTIONAL NETWORKS

us the opp
opportunit
ortunit
ortunity y to take the pretraining strategy one step further than is p ossible
with multila
ultilayyer p erceptrons. Instead of training an entire conv convolutional
olutional la
layyer at a
us the opp ortunit
time, we can train a moy to take
model the pretraining
del of a small patc h, as Coates et al. (2011) do with kp-means.
patch, strategy one step further than is ossible
with
W e canmultila
thenyuse
er pthe
erceptrons.
parameters Instead
from of training
this an entire
patch-based mo
modelconv
del toolutional
define thelaykernels
er at a
time, we
of a conv can train
convolutional
olutional laya mo
layer. del of a small patc h, as Coates et al.
er. This means that it is p ossible to use unsup (2011 ) do with
unsupervised k -means.
ervised learning
W
toetrain
can then
a convuse the parameters
convolutional
olutional network from
netw this patch-based
without ev
everer using mo delolution
conv to defineduring
convolution the kernels
the
of a conv
training proolutional lay er. This means that it is p ossible
cess. Using this approach, we can train very large mo
process to use unsup ervised
models learning
dels and incur a
to train a conv olutional netw ork without ev er using conv
high computational cost only at inference time (Ranzato et al., 2007b; Jarrett olution duringetthe al.,
training
2009; KaKavuk pro
vuk cess
vukcuoglu . Using
cuoglu et al. this approach, we
al.,, 2010; Coates et al. can train very large mo dels and
al.,, 2013). This approach was p opular incur a
high computational
from roughly 2007–2013, cost onlywhenat inference
lab
labeled time (Ranzato
eled datasets were et al., 2007b
small ; Jarrett et al.,
and computational
2009
p ower; Ka vukmore
was cuoglu et al., 2010
limited. ; Coates
To day
day, , most et
conval.,olutional
2013). This
convolutional net
netw wapproach
orks are was p opular
trained in a
from roughly
purely sup 2007–2013,
supervised when lab eled datasets were small
ervised fashion, using full forward and back-propagation through the and computational
p
enow
entireer w
tire net as
netw more
work onlimited.
eac
each To dayiteration.
h training , most convolutional networks are trained in a
purely sup ervised fashion, using full forward and back-propagation through the
As with other approaches to unsup unsupervised
ervised pretraining, it remains difficult to
entire network on each training iteration.
tease apart the cause of some of the b enefits seen with this approac approach. h. Unsupervised
As with
pretraining ma other
may approaches to unsup
y offer some regularization relativervised pretraining,
relativee to sup ervised training,difficult
supervised it remains or it may to
tease apart the cause
simply allow us to train muc of some
much of the b enefits seen with this approac h.
h larger architectures due to the reduced computationalUnsupervised
pretraining
cost may offerrule.
of the learning some regularization relative to sup ervised training, or it may
simply allow us to train much larger architectures due to the reduced computational
cost of the learning rule.
9.10 The Neuroscientific Basis for Con Conv volutional Net-
works
9.10 The Neuroscientific Basis for Convolutional Net-
Con
Conv works
volutional networks are p erhaps the greatest success story of biologically
netw
inspired artificial intelligence. Though conv convolutional
olutional netw
networks
orks hav
havee b een guided
Con volutional net w orks are p erhaps the greatest success
by many other fields, some of the key design principles of neural netw story of biologically
networks
orks were
inspired
dra
drawn artificial intelligence.
wn from neuroscience. Though conv olutional netw orks hav e b een guided
by many other fields, some of the key design principles of neural networks were
The history of conv convolutional
olutional net
netw works b egins with neuroscien
neuroscientific
tific exp
experimen
erimen
erimentsts
drawn from neuroscience.
long b efore the relev
relevant
ant computational mo models
dels were dev
developed.
eloped. Neuroph
Neurophysiologists
ysiologists
Da The
David
vid Hub history
Hubel of conv olutional net
el and Torsten Wiesel collab w orks
collaboratedb egins with
orated for sev neuroscien
several tific exp
eral years to determine erimen
man
manytsy
long b efore the relev ant computational mo dels were dev eloped.
of the most basic facts about how the mammalian vision system works (Hub Neuroph ysiologists
Hubelel and
David ,Hub
Wiesel 1959el, and
1962T , orsten
1968). Wiesel collab orated for sev
Their accomplishments eralev
were years
even
en to determine
entually
tually recognizedman withy
aofNob
the el
Nobel most basic
prize. facts
Their about that
findings how ha thevemammalian
hav visioninfluence
had the greatest system works (Hub el and
on contemporary
Wiesel , 1959, 1962
deep learning mo , 1968
models
dels ). Their
were based accomplishments were ev
on recording the activit
activityyen
oftually recognized
individual neuronswith
in
acats.
NobThey
el prize. Their findings that ha ve had the
observed how neurons in the cat’s brain resp greatest influence
responded on contemporary
onded to images pro projected
jected
deep learning
in precise lo mo
locationsdels were based on recording the activit y of
cations on a screen in front of the cat. Their great disco individual neurons
discovery
very was in
cats. neurons
that They observed how neurons
in the early in the cat’s
visual system resp brain resp
responded
onded mostonded to images
strongly to verypro
spjected
specific
ecific
in precise lo
patterns of ligh cations
light, on a screen in front of the cat.
t, such as precisely oriented bars, but resp Their onded hardly at allwas
great
responded disco very to
that neurons
other patterns. in the early visual system resp onded most strongly to very sp ecific
patterns of light, such as precisely oriented bars, but resp onded hardly at all to
other patterns. 365
CHAPTER 9. CONVOLUTIONAL NETWORKS

Their work help helped


ed to characterize many asp aspects
ects of brain function that are
b ey
eyond
ond the scop
scopee of this b ook. From the p oint of view of deep learning, we can
fo
focusTheir
cus on awork help edcartoon
simplified, to characterize many
view of brain asp ects of brain function that are
function.
b eyond the scop e of this b ook. From the p oint of view of deep learning, we can
In on
fo cus thisa simplified
simplified,view, we fo
cartoon focus
cus on
view a partfunction.
of brain of the brain called V1, also known as
the primary visual cortex ortex.. V1 is the first area of the brain that b egins to p erform
In thistly
significan
significantly simplified
adv ancedview,
advanced pro we fo cus
processing
cessing of on a part
visual of the
input. Inbrain calledonV1
this carto
cartoon , alsoimages
view, knownareas
the primary
formed by lighvisual
light cortexin
t arriving . V1
theiseye
theandfirststimulating
area of the the
brain that the
retina, b egins to p erforme
light-sensitiv
light-sensitive
significan
tissue in tly
theadv anced
back pro cessing
of the eye. Theof visual
neuronsinput. In this
in the carto
retina on view,
p erform images
some are
simple
formed
prepro b y
preprocessing ligh t arriving in the eye and stimulating the
cessing of the image but do not substantially alter the wa retina,
way the light-sensitiv
y it is represented. e
tissue
The in the
image thenback of the
passes eye. the
through The neurons
optic nervee in
nerv the
and retinaregion
a brain p erform
calledsome
the simple
lateral
prepro cessing of the image but do not substantially alter the wa y
geniculate nucleus. The main role, as far as we are concerned here, of both of theseit is represented.
The image then
anatomical passes
regions through the
is primarily justoptic nervethe
to carry andsignal
a brain
fromregion called
the eye to the
V1, lateral
whic
whichh
geniculate
is lo
located nucleus.
cated at the bac The
back main role,
k of the head. as far as w e are concerned here, of both of these
anatomical regions is primarily just to carry the signal from the eye to V1, which
A con
convvolutional netnetwwork lalay
yer is designed to capture three prop properties
erties of V1:
is lo cated at the back of the head.
1.A V1
conis
volutional
arrangednet in w a ork layermap.
spatial is designed
It actuallyto capture
has a twthree prop ertiesstructure
o-dimensional of V1:
mirroring the structure of the image in the retina. For example, ligh lightt
1. arriving
V1 is arranged in
at the low
lowera spatial map. It actually has a
er half of the retina affects only the corresp t w o-dimensional
corresponding structure
onding half of
mirroring
V1. ConConv the structure
volutional net netw of the image
works capture this prop in the
property retina. F or example,
erty by having their features light
arriving in
defined atterms
the lowofertwhalf of the retina
o dimensional affects only the corresp onding half of
maps.
V1. Convolutional networks capture this prop erty by having their features
defined
2. V1 in terms
contains man
many yofsimple
two dimensional
cel
ells
ls maps.
ls.. A simple cell’s activit
activityy can to some extent b e
characterized by a linear function of the image in a small, spatially lo localized
calized
2. receptiv
V1 contains
receptive manThe
e field. y simple cel ls.units
detector A simple
of a con cell’s
conv activity can
volutional net
netwwto
orksome
are extent
designedbe
characterized
to em
emulate
ulate theseby aprop
linear
properties function
erties of the
of simple image in a small, spatially lo calized
cells.
receptive field. The detector units of a convolutional network are designed
to em
3. V1 ulate
also these prop
contains man
many erties
y complexof simple
cel ls..cells.
ells
ls These cells resp respond
ond to features that
are similar to those detected by simple cells, but complex cells are in inv
variant
3. to
V1small
also contains man y complex cel ls . These cells resp ond
shifts in the p osition of the feature. This inspires the p o oling unitsto features that
arecon
of similar
conv to those
volutional net
netw wdetected by simple
orks. Complex cellscells,
are but
alsocomplex
inv arianttcells
invarian
arian are inchanges
to some variant
to ligh
in small
lightingshifts
ting thatincannot
the p osition
b e capturedof thesimply
feature.byThis inspires
p o oling overthe p o oling
spatial lo units
locations.
cations.
of convin
These olutional
variancesnet
inv works.
hav
have Complex
e inspired somecells arecross-c
of the also inv ariantp otooling
cross-channel
hannel somestrategies
changes
in ligh
in con
conv ting that cannot
volutional netnetw b e
works, succaptured
such simply b y
h as maxout units (Go p o oling odfellow et al., cations.
ov er
Goodfellow spatial lo 2013a).
These invariances have inspired some of the cross-channel p o oling strategies
in convwolutional
Though e kno
know netw
w the orks,ab
most suc
about
outh as
V1,maxout units (Gobodfellow
it is generally eliev
elieved et al.the
ed that , 2013a
same ).
basic principles apply to other areas of the visual system. In our carto cartoonon view of
Though we kno w the most ab out V1, it is
the visual system, the basic strategy of detection follogenerally
follow b eliev ed that
wed by p o oling is rep theeatedly
same
repeatedly
basic principles
applied as we movapply
move toerother
e deep
deeper into areas of theAsvisual
the brain. we pass system. In our
through cartoanatomical
multiple on view of
the
la
lay visual
yers system,
of the brain,the
we basic
ev
even
en strategy
entually
tually findofcells
detection folloond
that resp wed to
respond bysome
p o oling
sp is repconcept
specific
ecific eatedly
applied as
and are invw e mov
invarian
arian e deep er into the brain. As we pass through
ariantt to many transformations of the input. These cells havm ultiple anatomical
havee b een
layers of the brain, we eventually find cells that resp ond to some sp ecific concept
and are invariant to many transformations 366 of the input. These cells have b een
CHAPTER 9. CONVOLUTIONAL NETWORKS

nic
nicknamed
knamed “grandmother cells”—the idea is that a person could hav havee a neuron that
activ
activates
ates when seeing an image of their grandmother, regardless of whether she
nicknamed
app ears in “grandmother
appears the left or righ cells”—the
rightt side of the ideaimage,
is thatwhether
a personthe couldimage havise aa neuron
close-up thatof
activ ates when
her face or zo zoomed seeing an image of
omed out shot of her entire b o dytheir grandmother, regardless of whether
dy,, whether she is brightly lit, or in she
app
shadoears
shadow, in
w, etc. the left or righ t side of the image, whether the image is a close-up of
her face or zo omed out shot of her entire b o dy, whether she is brightly lit, or in
These grandmother cells ha hav ve b een sho shown wn to actually exist in the human brain,
shadow, etc.
in a region called the medial temp temporal oral loblobee (Quiroga et al., 2005). Researc Researchers hers
These grandmother cells ha
tested whether individual neurons would resp v e b een sho wn to
respond actually exist in the
ond to photos of famous individuals.human brain,
in a region called the medial temp
They found what has come to b e called the “Halle oral lob e ( Quiroga
Berry et al. , 2005).anResearc
neuron”: individualhers
tested whether
neuron that is activindividual
ated bneurons
activated y the concept wouldof resp ond Berry
Halle to photos
Berry. . Thisofneuronfamousfires individuals.
when a
They found what has come
p erson sees a photo of Halle Berry to b e called the “Halle
Berry,, a drawing of Halle Berry Berry neuron”: an
Berry,, or even text containingindividual
neuron that is
the words “Halle Berry activ ated
Berry.” by the concept of Halle
.” Of course, this has nothing to do Berry . This
withneuron fires when
Halle Berry herself; a
p erson sees a
other neurons resp photo of
responded Halle Berry , a drawing
onded to the presence of Bill Clin of Halle Berry
Clinton, , or even text
ton, Jennifer Aniston, etc. containing
the words “Halle Berry.” Of course, this has nothing to do with Halle Berry herself;
These medial temp temporaloral lob
lobee neurons are somewhat more general than mo modern
dern
other neurons resp onded to the presence of Bill Clinton, Jennifer Aniston, etc.
con
conv volutional net netw works, which would not automatically generalize to iden identifying
tifying
These medial
a p erson or ob object temp oral lob e neurons are somewhat
ject when reading its name. The closest analog to a con more general than
conv mo dern
volutional
conw
net
netw volutional
ork’s last net lay w
layererorks, which would
of features is a brainnot automatically
area called the generalize
inferotemp to oral
inferotemporal identifying
cortex
a p erson
(IT). When or ob ject when
viewing an obreading
object, its name. The
ject, information flowsclosest
fromanalog
the retina,to a con volutional
through the
net w ork’s last
LGN, to V1, then onw lay er of
onward features is a brain area
ard to V2, then V4, then IT. This happ called the inferotemp
happens oral
ens within the firstcortex
(IT). When viewing
100ms of glimpsing an ob an ob ject,
object. information
ject. If a p erson is allow flows
allowedfrom
ed to conthe
continueretina,
tinue lo through
looking
oking the
at the
LGN,
ob
object to V1, then onw ard to V2, then
ject for more time, then information will b egin to flo V4, then IT. This
flow happ
w bac
backw kwens
kwards within the
ards as the brain first
100ms of glimpsing
uses top-down feedbackan obtoject.
up If a the
update
date p erson
activ isations
allowed
activations to con
in the low
lowertinue lo oking
er level brainat the
areas.
ob
Ho
Howject
wev for ifmore
ever,
er, time, then
we interrupt theinformation
p erson’s gaze, will and
b egin to flowonly
observe backw theards
firing as rates
the brain
that
uses top-down feedback to up
result from the first 100ms of mostly feedforw date the activ
feedforward ations in
ard activ the
activation, low er level
ation, then IT pro brain
prov areas.
ves to b e
Ho w ev er, if w
very similar to a cone interrupt
conv the p
volutional neterson’s
netw gaze,
work. Conv and observe
Convolutional
olutional netwonly
networks orks can predictthat
the firing rates IT
result rates,
firing from the andfirstalso100ms
p erformof mostly feedforwto
very similarly ard(time
activation,
limited) then IT proon
humans vesob toject
objectbe
vrecognition
ery similar tasks to a con volutional
(DiCarlo , 2013net). work. Convolutional networks can predict IT
firing rates, and also p erform very similarly to (time limited) humans on ob ject
That b eing
recognition tasks said, there ,are
(DiCarlo 2013 man
many
). y differences betw between
een conconv volutional net netw works
and the mammalian vision system. Some of these differences are well known
That b eing said,
to computational there are manbut
neuroscientists, y differences
outside the betw eene con
scop
scope volutional
of this b o ok. netSomeworks of
and the mammalian vision system.
these differences are not yet known, because man Some of
manythese differences
y basic questions ab are
aboutw ell
out ho known
howw the
to computational neuroscientists,
mammalian vision system works remain unansw but outside the
unanswered. scop e of this
ered. As a brief list: b o ok. Some of
these differences are not yet known, because many basic questions ab out how the
mammalian
• The human visioneyeyesystem works
e is mostly very remain unanswered.
low resolution, exceptAsfor a brief
a tin
tinyylist:
patc
patch h called the
fove
foveaa. The fovfovea
ea only observes an area ab about
out the size of a thum
thumbnail
bnail held at
The h uman eye is mostly very low resolution,
arms length. Though we feel as if we can see an en excepttire scene in highhresolution,
entirefor a tiny patc called the
foveais. an
• this Theillusion
fovea only observes
created by theansub
area ab out the
subconscious
conscious partsize of abrain,
of our thumbnail held at
as it stitches
arms length.
together Though
several we feel
glimpses as if areas.
of small we canMost
see ancon
envtire
conv scene in
olutional high
netw
networksresolution,
orks actually
receivee large full resolution photographs as input. The human brainstitches
this
receivis an illusion created by the sub conscious part of our brain, as it makes
together several glimpses of small areas. Most convolutional networks actually
receive large full resolution photographs367 as input. The human brain makes
CHAPTER 9. CONVOLUTIONAL NETWORKS

sev
several
eral ey
eyee mov
movemen
emen
ements ts called sac sacccades to glimpse the most visually salien salientt or
task-relev
task-relevant ant parts of a scene. Incorporating similar attention mechanisms
sev
in toeral
into deepeyelearning
movemen motsdels
modelscalled saccactive
is an ades toresearch
glimpsedirection.
the most visually
In the con salien
textt or
context of
task-relev ant
deep learning, atten parts of
attention a
tion mecscene. Incorporating
mechanisms
hanisms ha hav similar attention mechanisms
ve been most successful for natural
in to deep
language pro learning
processing, mo dels
cessing, as describ is an
described active research
ed in Sec. 12.4.5.1 direction.
. Several In the conmo
visual text
models of
dels
deep fo
with learning,
fov attention mec
veation mechanisms hanisms
hav
havee b een ha devvelop
e been
develop
eloped mostsosuccessful
ed but far hav
havee notfor bnatural
ecome
language
the dominan pro cessing,
dominantt approac
approach as describ
h (Laro ed in Sec. 12.4.5.1 .
Larocchelle and Hinton, 2010; Denil et al.Several visual
al.,, 2012dels
mo ).
with foveation mechanisms have b een develop ed but so far have not b ecome
• The
the dominan
human tvisualapproac h (Laro
system is cintegrated
helle and Hinton , 2010;other
with many Denilsenses,
et al., 2012
such).as
hearing, and factors like our mo moo o ds and thoughts. Conv Convolutional
olutional netwnetworks
orks
The human visual
so far are purely visual. system is integrated with many other senses, such as
• hearing, and factors like our mo o ds and thoughts. Conv olutional netw orks
• The
so farhuman
are purely
visualvisual.
system do does
es muc
much h more than just recognize ob objects.
jects. It is
able to understand entire scenes including many ob objects
jects and relationships
The
b et
etw h uman
ween ob visual
objects, system
jects, and pro do
processes es muc h more than just
cesses rich 3-D geometric information recognize obneeded
jects. Itfor is
ablebto
• our understand
odies entire
to interface withscenes
the w including
orld. Conv many
Convolutional ob jects
olutional and
netw
networks relationships
orks hav
havee been
b et w een ob jects, and pro cesses rich 3-D geometric
applied to some of these problems but these applications are in their infancy information needed for.
infancy.
our b odies to interface with the world. Convolutional networks have been
• Evapplied
Even to some
en simple brainof areas
theselikeproblems
V1 arebut hea theseimpacted
heavily
vily applications are in their
by feedback frominfancy
higher.
lev
levels.
els. Feedbac
eedback k has b een explored extensiv extensively ely in neural netw network
ork momodels
dels but
Ev en simple brain
has not yet b een sho areas
shown like V1 are hea vily
wn to offer a compelling improv impacted
improvemen b y feedback
emen
ement.t. from higher
• lev els. F eedbac k has b een explored extensiv ely in neural netw ork mo dels but
• While
has notfeedforw
yet b een
feedforwardardshoITwn to offer
firing ratesa capture
compelling muc improv
uchh of the emen t. information as
same
con
conv volutional net netwwork features, it is not clear how similar the intermediate
computations are. IT
While feedforw ard The firing
brainrates captureuses
probably mucvery
h of the same activ
different information
activation
ation and as
• p con volutional
ooling network
functions. An features,
individual it neuron’s
is not clear how
activ
activation similar
ation the intermediate
probably is not well-
computations are. The brain
characterized by a single linear filter resp probably uses
response. very different
onse. A recent mo model activ
del of V1 ation
in
inv and
volv
olves
es
p ooling functions. An
multiple quadratic filters for eac individualeach neuron’s activ ation probably
h neuron (Rust et al., 2005). Indeed our is not well-
ccarto
haracterized
cartoon by a single linear
on picture of “simple cells” and filter resp onse. A recent
“complex cells” momighdelt of
might V1 inavolv
create es
non-
m ultiple
existen
existent quadratic filters
t distinction; simplefor eacand
cells h neuron
complex (Rustcellsetmight
al., 2005
b oth). bIndeed
e the same our
cartoof
kind oncell
picture of “simple
but with cells” and “complex
their “parameters” enablingcells”a con
continmigh
tin
tinuum
uumt create a non-
of b ehaviors
existent from
ranging distinction;
what wsimple cells andtocomplex
e call “simple” what wecells call might b oth b e the same
“complex.”
kind of cell but with their “parameters” enabling a continuum of b ehaviors
It ranging
is also from
worthwhat we call “simple”
mentioning to what wehas
that neuroscience call told
“complex.”
us relatively little
ab
about
out how to tr train
ain con
convvolutional netw
networks.
orks. Mo
Model
del structures with parameter
It is also worth mentioning
sharing across multiple spatial lo cations date back tohas
that
locations neuroscience told
early us relativelymo
connectionist little
models
dels
ab out
of how
vision to train
(Marr and con volutional
Poggio , 1976),netw
butorks.
theseMo del structures
models withthe
did not use parameter
modern
sharing
bac across multiple
back-propagation
k-propagation spatial
algorithm andlogradient
cations date backFto
descent. or early connectionist
example, the Neo mo dels
Neocognitron
cognitron
(of vision (Marr
Fukushima , 1980and Poggio
) incorp , 1976
incorporated
orated ), but
most these
of the mo models
model
del did notdesign
architecture use the modern
elemen
elementsts of
bac k-propagation
the mo
modern
dern conv algorithm
convolutional
olutional netand
netw gradient descent.
work but relied on a lay F or example,
layer-wise
er-wise unsup the Neo
unsupervised cognitron
ervised clustering
(algorithm.
Fukushima, 1980) incorp orated most of the mo del architecture design elements of
the mo dern convolutional network but relied on a layer-wise unsup ervised clustering
algorithm. 368
CHAPTER 9. CONVOLUTIONAL NETWORKS

Lang and Hinton (1988) introduced the use of back-propagation to train time-
delay neurneuralal networks (TDNNs). To use contemporary terminology terminology,, TDNNs are
Lang and
one-dimensional con Hinton
conv ( 1988 )
volutional netintroduced
netw the use of back-propagation
works applied to time series. Back-propagation to train time-
delay neur
applied to al networks
these mo
modelsdels(TDNNs).
was not inspiredTo use contemporary
by any neuroscientific terminology , TDNNs
observ
observation
ation and are
one-dimensional con v olutional net w orks applied to
is considered by some to b e biologically implausible. Following the success of time series. Back-propagation
applied
bac to these mo dels training
back-propagation-based
k-propagation-based was not inspired
of TDNNs, by any(LeCun neuroscientific
et al., 1989observ
) dev ation and
developed
eloped the
is
mo considered
modern
dern conconv by some
volutional net to
netw b e biologically implausible. Following
work by applying the same training algorithm to 2-D the success of
bac
con
conv vk-propagation-based
olution applied to images. training of TDNNs, (LeCun et al., 1989) developed the
mo dern convolutional network by applying the same training algorithm to 2-D
So far we hav havee described how simple cells are roughly linear and selectiv selectivee for
convolution applied to images.
certain features, complex cells are more nonlinear and b ecome inv invariant
ariant to some
So far we havof
transformations e described
these simple howcellsimple cells are
features, androughly
stac
stacks
ks oflinear
la
layyersandthatselectiv e for
alternate
bcertain
et
etwweenfeatures,
selectivity complex
and in cells
inv are more
variance nonlinear
can yield and b ecome
grandmother cellsinvforariant
very tosp some
specific
ecific
transformations
phenomena. We ha of
hav these simple
ve not yet describ cell
described features, and stac ks of la yers that
ed precisely what these individual cells detect. alternate
b etaween
In deep,selectivity
nonlinearand netw inork,
network,variance
it can canb eyield grandmother
difficult to understandcells forthevery sp ecific
function of
phenomena. W e ha v e not y et
individual cells. Simple cells in the first lay describ ed precisely
layer what these individual
er are easier to analyze, b ecause theircells detect.
In
resp a deep,
responses nonlinear netw ork, it can b
onses are driven by a linear function. In an artificial e difficult to understand
neural net thework,
netw function
we can of
individual cells. Simple
just display an image of the conv cells in the first
convolution lay er are easier to analyze,
olution kernel to see what the corresp b ecause
corresponding their
onding
cresp
hannelonsesofare
a condriven
conv by a linear
volutional la
layyer function.
resp onds In
responds to.an Inartificial neural
a biological netwnet
neural ork,
netw we can
work, we
just display
do not hav an image
havee access to the weighof the conv
eights olution k ernel to see what
ts themselves. Instead, we put an electro the corresp
electrode onding
de in the
cneuron
hannelitself,
of a con v
displa
displayolutional
y sev
several la y er resp onds to. In a biological neural
eral samples of white noise images in front of the animal’s net w ork, we
do not hav e access to the w eigh ts themselves.
retina, and record how each of these samples causes the neuron to activ Instead, w e put an electro de
activate. in W
ate. thee
neuron
can thenitself,
fit a displa
linearymo sevdel
model eraltosamples
these resp of white
responses
onses in noise images
order in front
to obtain an of the animal’s
approximation
retina,
of and record
the neuron’s weigh how
weights. ts.each
Thisofapproach
these samples
is known causes the neuron
as reverse corr to activ
orrelation
elation ate. Whe
(Ringac
Ringach
can then
and Shapleyfit a, 2004
linear). mo del to these resp onses in order to obtain an approximation
of the neuron’s weights. This approach is known as reverse correlation (Ringach
Rev
Reverse
erse correlation shows us that most V1 cells hav havee weigh
weights ts that are describ
described ed
and Shapley, 2004).
by GabGaboror functions
functions.. The Gab Gabor or function describes the weigh weightt at a 2-D p oin ointt in the
image. Reverse
We cancorrelation
think ofshows an imageus that as most
b eingV1 cells haveofweigh
a function 2-D ts co that are describ
coordinates,
ordinates, I (x, yed).
b
Liky Gab
Likewise, or functions . The Gab or function describes
ewise, we can think of a simple cell as sampling the image at a set of lothe weigh t at a 2-D p oin t in the
locations,
cations,
image. W e can
defined by a set of x co think of an image as b eing
ordinates X and a set of y co
coordinates a function of 2-D
coordinates, co ordinates,
ordinates, Y, and applying I (x, y).
wLik ewise,
eigh
eights we are
ts that can also
thinka of a simple
function ofXcell
the as
lo sampling
location,
cation, the
w (x, y )image
. Fromatthis a set of lo cations,
Y p oint of view,
defined
the resp by
onsea of
response x co ordinates
setaofsimple cell to an image and ais setgiv
given of ybyco ordinates, , and applying
en
weights that are also a function ofX theX lo cation, w (x, y ). From this p oint of view,
the resp onse of a simple cell s(Ito) =an imagewis(x, givy )en
I (x,
byy ). (9.15)
x∈X y∈Y
s(I ) = w (x, y )I (x, y ). (9.15)
Sp
Specifically
ecifically
ecifically,, w (x, y ) takes the form of a Gabor function:
 02

w (x, y ; α,, w
Sp ecifically β x(x,
, βyy ,)ftakes
, φ, x 0the
, y 0,form
τ ) = of
α exp
a Gabor − βy y 02 cos(
−βxxfunction: f x 0 + φ),
cos(f (9.16)

w (x, y ; α , β , β , f , φ, x , y , τ ) X
where = αX
exp β x β y cos(f x + φ), (9.16)
0
x = ((xx − x0 ) cos(τ ) +
− (y − y−
0) sin(τ ) (9.17)
where
x = (x x ) cos(369 τ ) + (y y ) sin(τ ) (9.17)
−  − 
CHAPTER 9. CONVOLUTIONAL NETWORKS

and
y 0 = −(x − x0) sin(τ ) + (y − y0 ) cos(τ ). (9.18)
and
Here, α , β x, βy , fy, φ= , x 0,(xy0, andx ) sin( τ) +
τ are (y y ) cos(
parameters thatτ ).control the prop (9.18)
properties
erties
of the Gab
Gabor or function. Fig. − 9.18 − shows some examples − of Gab
Gabor or functions with
Here,
differen
different α , β , βof, these
t settings f, φ, xparameters.
, y , and τ are parameters that control the prop erties
of the Gab or function. Fig. 9.18 shows some examples of Gab or functions with
Thetparameters x0, y0 , and τ define a co coordinate
ordinate system. W Wee translate and
differen settings of these 0 parameters.
0
rotate x and y to form x and y . Sp Specifically
ecifically
ecifically,, the simple cell will resp respond
ond to image
The parameters
features centered at the p oin x , y , and τ define a co ordinate
ointt (x 0, y 0), and it will resp respond system. W e translate
ond to changes in brigh brightnessand
tness
rotate
as we mox and
mov y to form
ve along a linex rotated
and y . τSpradians
ecifically , thethe
from simple celltal.
horizon
horizontal. will resp ond to image
features centered at the p oint 0(x , y 0), and it will resp ond to changes in brightness
as wView
Viewed
e moed ve as a function
along of x and
a line rotated
y , the function w then resp
τ radians from the horizonresponds tal.onds to changes in
0
brigh
brightness
tness as we mov movee along the x axis. It has two imp important
ortant factors: one is a
View ed as a function of x and
Gaussian function and the other is a cosine function. y , the function w then resp onds to changes in
brightness as we move along the  x axis. It has  two imp ortant factors: one is a
The Gaussian factor α exp −β xx 02 − βy y02 can b e seen as a gating term that
Gaussian function and the other is a cosine function.
ensures the simple cell will only resp respondond to values near where x 0 and y 0 are b oth
The Gaussian factor
zero, in other words, near the cen α exp β x y can
ter of theβ cell’s
center b e seen
receptiv
receptive e field.as aThegating termfactor
scaling that
ensures
α adjuststhe thesimple cell will only
total magnitude respsimple
of−the ond− tocell’s
valuesresp near whilexβxand
onse,where
response, andy β arey con
btrol
oth
control
zero,
ho
how in other
w quic
quickly
kly its words,
receptivnear
receptive the cen
e field teroff.
falls of the cell’s receptive field. The scaling factor
α adjusts the total magnitude of the simple cell’s resp onse, while β and β control
cos((f x0 + φ ) con
cos
howThequiccosine
kly itsfactor
receptiv 0
e field fallscontrols
trols how the simple cell resp
off. responds
onds to ch changing
anging
brigh
brightness
tness along the x axis. The parameter f con controls
trols the frequency of the cosine
The
and φ con cosine
controls factor cos
trols its phase offset.( f x + φ ) con trols how the simple cell resp onds to changing
brightness along the x axis. The parameter f controls the frequency of the cosine
Altogether, this carto cartoon on view of simple cells means that a simple cell resp responds
onds
and φ controls its phase offset.
to a spspecific
ecific spatial frequency of brightness in a sp specific
ecific direction at a sp specific
ecific
lo Altogether,
location. this carto on view of simple cells means
cation. Simple cells are most excited when the wave of brightness in the image that a simple cell resp onds
to athe
has sp ecific
same phasespatialasfrequency
the weigh
weights. of brightness
ts. This o ccursinwhen a sptheecific
imagedirection
is brigh
brightatt where
a sp ecificthe
lo cation.
weigh
eights Simple
ts are p ositiv cells are most excited
ositivee and dark where the weigh when
eights the w a ve
ts are negativ of
negative. brightness in the
e. Simple cells are most image
has the same phase
inhibited when the wa as
wave the weigh ts. This o ccurs when
ve of brightness is fully out of phase with the the image is brigh
weigh t where
eights—when
ts—when the
w eigh ts are p ositiv e and
the image is dark where the weigh dark where
eights the w eigh
ts are p ositiv ts are negativ e. Simple
ositivee and bright where the weigh cells are
eights most
ts are
inhibited
negativ
negative. e. when the wa ve of brightness is fully out of phase with the w eigh ts—when
the image is dark where the weights are p ositive and bright where the weights are
The carto
cartoon on view of a complex cell is that it computes p the L 2 norm of the
negative.
2-D vector con containing
taining tw two o simple cells’ resp onses: c( I) = s0(I )2 + s1 (I )2 . An
responses:
imp The
importan
ortan carto
ortantt sp on view of a complex cell
ecial case o ccurs when s 1 has all of the
special is that it computes
same parametersthe L norm of the
as s0 except
2-D
for φvector
, and con φ istaining
set such twothat simple
s1 iscells’
one resp
quarteronses: c( I)out
cycle = ofsphase (I ) +with s (I )s .. An
0 In
imp ortan t sp ecial case
this case, s0 and s1 form a quadr o ccurs when
quadratur s
atur has
aturee pair all of the same parameters
air.. A complex cell defined in this wa as s except
way y
for
resp φ ,
responds and φ is set such
onds when the Gaussian reweigh that s is
reweighted one quarter
ted image I( x, y) expcycle out of phase
02 with
02
exp((−βx x − βy y ) contains s . In
this case, s and
a high amplitude sin s form
sinusoidal a
usoidal wa quadr
wav atur e p air . A complex
ve with frequency f in direction cell defined τ nearin this
(x0 ,way 0 )y,
resp onds when I( .x,In
y) other p
exp ( words,
β x the β ycomplex
regardless of the
theGaussian
phase offset reweigh oftedthisimage
wa
wave ve ) containscell
a high
is inv amplitude sin usoidal wa v e with frequency f
ariantt to small translations of the image in direction τ , or to negating the
invarian
arian in direction
− − τ near (x , y ),
regardless of the phase offset of this wave. In other words, the complex cell
is invariant to small translations of the370 image in direction τ , or to negating the
CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.18: Gabor functions with a variety of parameter settings. White indicates
large p ositiv
ositivee weight, black indicates large negative weigh weight,t, and the background gra grayy
Figure
corresp 9.18:toGabor
corresponds
onds functions
zero weigh
eight.t. (L
(Left) with
eft) Gab aorvariety
Gabor functions of parameter
with different settings.
values ofWhite indicates
the parameters
large control
that p ositivethe
weight, black system:
coordinate indicatesx 0large negative
, y 0, and τ . Eacweigh
Each h Gabt, and
Gabor the background
or function in this grid graisy
corresp onds
assigned to zero
a value of xw0eigh
and (L eft)
t. y 0 prop Gab or functions
proportional
ortional to its pwith different
osition valuesand
in its grid, of the
τ isparameters
chosen so
control
that each Gab the
Gabor coordinate
or filter system:
is sensitive x ,direction
to the y , and radiating
τ . Each Gab or function
out from in this
the center grid
of the is
grid.
assigned
F a value
or the other tw
twoof
o x
plots,and y
,
x0 y 0 prop
, andortional
τ are to
fixed itstop osition
zero. in its
(Center) grid,
Gab
Gaborand
or τ is chosen
functions so
with
that each
differen
different Gab or filter
t Gaussian scaleis parameters
sensitive to βthe direction
x and βy . Gab radiating
Gabor out from
or functions the center
are arranged in of the grid.
increasing
For the(decreasing
width other two plots,
β x) as wex y
, mo , and
move τ
ve left are fixed through
to right (Center)
to zero. the grid, and Gab or functionsheight
increasing with
different Gaussian
(decreasing scale
βy ) as we mov
move e top to bβottom.
parameters and βFor . Gab
the or functions
other tw
two are arranged
o plots, the β valuesin increasing
are fixed
width
to 1.5
1.5× the image βwidth.
×(decreasing ) as we moveGabor
(Right) left to functions
right through with the grid, sin
different and increasing
sinusoid
usoid parametersheight
f
(decreasing β
and φ. As we )mov as ewetop
move mov toebtop to b ottom.
ottom, For the
f increases, andother
as wetwmoo plots,
mov ve leftthe β values
to right, are fixed
φ increases.
toor1.5
F the image
the other width.φ (Right)
two plots, is fixed to Gabor
0 and functions
f is fixedwith to 5different
× the imagesinusoid
width.parameters f
and φ× . As we move top to b ottom, f increases, and as we move left to right, φ increases.
For the other two plots, φ is fixed to 0 and f is fixed to 5 the image width.
image (replacing blac black k with white and vice versa). ×
Some of the most striking corresp correspondences
ondences b et etwween neuroscience and machine
image (replacing black with white and vice versa).
learning come from visually comparing the features learned by mac machine
hine learning
mo Some
models of the most
dels with those employstriking
employed ed by V1. Olshausen and Field (1996) and
corresp ondences b et ween neuroscience sho
showwmachine
ed that
learning come from visually comparing
a simple unsupervised learning algorithm, sparse cothe features learned
coding, b y mac hine learning
ding, learns features with
mo
receptivee fields similar to those of simple cells. Since Field
dels
receptiv with those employ ed by V1. Olshausen and then, (we1996 ) esho
hav
have wed that
found that
a simple
an unsupervised
extremely learning
wide variety algorithm,
of statistical sparsealgorithms
learning co ding, learns
learn features with
features with
receptiv
Gab
Gabor-lik e efields
or-lik
or-like similar
functions to those
when appliedof simple cells.images.
to natural Since then,
This we have found
includes that
most deep
an extremely
learning wide vwhich
algorithms, ariety learn
of statistical learning
these features algorithms
in their first lay learn
layer. features
er. Fig. 9.19 showith
shows
ws
Gab or-like functions when applied to natural images. This includes
some examples. Because so many different learning algorithms learn edge detectors, most deep
learning
it algorithms,
is difficult whichthat
to conclude learnanythese
sp features
specific
ecific in their
learning first layiser.
algorithm theFig. 9.19 mo
“right” shodel
modelws
some
of theexamples.
brain justBecause
based onsothe
many different
features thatlearning
it learns algorithms
(though itlearncan edge detectors,
certainly be a
it is difficult to conclude
bad sign if an algorithm do that
es not learn some sort of edge detector when applieddel
does any sp ecific learning algorithm is the “right” mo to
of the brain just based on the features
natural images). These features are an imp that ortantt part of the statistical structurea
it
importan
ortanlearns (though it can certainly b e
bad sign if images
of natural an algorithm
and candobese recov
not learn
ered some
recovered by man
manysort of edget detector
y differen
different approac
approaches when
hes to applied to
statistical
natural
mo
modeling. images).
deling. See Hyv These
Hyvärinen
ärinenfeatures
et al. (are an) imp
2009 for aortan t part
review of of
thethe statistical
field of natural structure
image
of natural
statistics. images and can b e recov ered by man y differen t approac hes to statistical
mo deling. See Hyvärinen et al. (2009) for a review of the field of natural image
statistics.
371
CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.19: Many mac


machine
hine learning algorithms learn features that detect edges or specific
colors of edges when applied to natural images. These feature detectors are reminiscen
reminiscentt of
Figure
the Gab 9.19:
Gabor Many mac
or functions hine to
known learning algorithms
be present learn visual
in primary features that detect
cortex. (L edges
eft) W
(Left) or learned
eights specific
colors
by an of edges
unsup when learning
unsupervised
ervised applied toalgorithm
natural images. These
(spike and feature
slab sparsedetectors
co ding)are
coding) reminiscen
applied t of
to small
the Gab or functions known
image patches. (Right) Con
Convto be present in primary visual cortex.
volution kernels learned by the first laylayer (Left) W eights learned
er of a fully supervised
by an
con
conv unsup ervised
volutional maxoutlearning
netw
network.
ork.algorithm (spike
Neighboring andofslab
pairs sparse
filters driveecothe
driv ding) applied
same maxoutto small
unit.
image patches. (Right) Convolution kernels learned by the first layer of a fully supervised
convolutional maxout network. Neighboring pairs of filters drive the same maxout unit.
9.11 Con
Conv volutional NetNetw works and the History of Deep
Learning
9.11 Convolutional Networks and the History of Deep
Con
Conv Learning
volutional net
netw
works ha
hav
ve pla
play
yed an important role in the history of deep
learning. They are a key example of a successful application of insights obtained
Con
b volutional
y studying thenet works
brain to ha ve playlearning
machine ed an important role They
applications. in thewere
history of deep
also some of
learning. They
the first deep mo are
models a key example of a successful application
dels to p erform well, long b efore arbitrary deep mo of insights obtained
models
dels were
by studyingviable.
considered the brain Contovolutional
Conv machine learning
net
netwworksapplications.
were also someTheyofwerethe also
firstsome
neural of
the
net
netw wfirst
orks deep
to solvmo
solve e dels
imp to p erform
importan
ortan
ortant well, long
t commercial b efore arbitrary
applications and remain deepat mo
thedels were
forefront
considered
of commercial viable. Convolutional
applications networksto
of deep learning were
day.. Falso
today
day some of inthe
or example, thefirst neural
1990s, the
net w orks
neural netw to
networksolv e imp
ork researc
research ortan t commercial applications
h group at AT&T developed a conv and remain
convolutionalat the
olutional netw forefront
network
ork for
of commercial
reading chec ks applications
hecks (LeCun et al.of deep).learning
, 1998b By the to day
end of. the
For 1990s,
example,
thisin the 1990s,
system deplo
deployythe
ed
neural netw ork
by NEC was reading ovresearc h
over group at A T&T
er 10% of all the chec developed
checks a conv olutional netw
ks in the US. Later, several OCR ork for
reading chec ks ( LeCun et al. , 1998b ).
and handwriting recognition systems based on conv By the end of the 1990s, this
convolutional
olutional netssystem deployed
were deploy
deployed ed
b y NEC was reading ov er 10% of all the chec ks in the
by Microsoft (Simard et al., 2003). See Chapter 12 for more details on such US. Later, several OCR
and handwriting
applications and morerecognition
mo dernsystems
modern basedofonconv
applications conv olutional
convolutional
olutional nets
net wereSee
works.
netw deploy
LeCun ed
by al.
et Microsoft
(2010) for(Simard
a moreetin-depth
al., 2003 ). SeeofChapter
history con
conv 12 fornet
volutional more
worksdetails
netw up to on such
2010.
applications and more mo dern applications of convolutional networks. See LeCun
Con
Conv volutional netw networks
orks were also used to win many contests. The current
et al. (2010) for a more in-depth history of convolutional networks up to 2010.
in
intensit
tensit
tensityy of commercial interest in deep learning b egan when Krizhevsky et al.
(2012Con volutional
) won the ImageNetnetworks ob
objectwererecognition
ject also used cto win many
hallenge, contests.
but conv
convolutionalThenetw
olutional current
networks
orks
intensity of commercial interest in deep learning b egan when Krizhevsky et al.
(2012) won the ImageNet ob ject recognition challenge, but convolutional networks
372
CHAPTER 9. CONVOLUTIONAL NETWORKS

had been used to win other mac machine


hine learning and computer vision contests with
less impact for years earlier.
had been used to win other machine learning and computer vision contests with
Con
Conv volutional nets were some of the first working deep netw networks
orks trained with
less impact for years earlier.
bac
back-propagation.
k-propagation. It is not en entirely
tirely clear why conv convolutional
olutional netw
networks
orks succeeded
Con v olutional nets w
when general back-propagation netwere some of the
networks first working deep
orks were considered to ha netw
hav orks trained
ve failed. It with
may
back-propagation.
simply b e that conv Itolutional
is not ennetw
convolutional tirely
networks clear
orks were why convcomputationally
more olutional networks succeeded
efficient than
when general back-propagation
fully connected net
netwworks, so it wnetw orks to
as easier wererunconsidered
multiple expto eriments
have failed.
experiments withItthem
may
simply
and tuneb etheir
that implementation
convolutional netw andorks were more computationally
hyperparameters. Larger netnetw efficient
works than
also seem
fully connected net w orks, so it
to b e easier to train. With modern hardw w as easier to
hardware, run multiple exp eriments
are, large fully connected net with
netw them
works
and
app
appeartune
ear to their implementation
p erform reasonably onand man
manyhyperparameters.
y tasks, even when Larger
usingnet works also
datasets that seem
were
to b e easier to
available and activ train.
activation With modern hardw are, large fully connected
ation functions that were p opular during the times when fully net w orks
app ear to pnet
connected erform
netwworksreasonably
were b elievedon mannoty to
tasks,
workeven when
well. using
It may b edatasets
that thethat were
primary
av ailable and activ ation functions
barriers to the success of neural net that
netw works were psychological (practitionersfully
were p opular during the times when did
connected
not exp
expectect net works
neural netwwere
networks b elieved
orks to work, notsoto work
they didwell.
not It
mak
makemay
e a bserious
e that effort
the primary
to use
barriersnet
neural towthe
netw success
orks). Whatev
Whateverof er
neural netwit
the case, orks were psychological
is fortunate that conv (practitioners
convolutional
olutional netw did
networks
orks
not
p exp ect well
erformed neural networks
decades ago.toInwork,
manysowa they
ways, did not
ys, they makthe
carried e a torch
seriousforeffort to use
the rest of
neural net w orks).
deep learning and pa Whatev
pav er the case, it is fortunate
ved the way to the acceptance of neural netthat conv olutional
netw netw orks
works in general.
p erformed well decades ago. In many ways, they carried the torch for the rest of
Con
Conv volutional netnetw works proprovide
vide a wa
way y to spspecialize
ecialize neural net netwworks to work
deep learning and paved the way to the acceptance of neural networks in general.
with data that has a clear grid-structured top topology
ology and to scale suc suchh mo
models
dels to
Con v olutional net w orks pro vide a wa y to sp ecialize neural
very large size. This approach has b een the most successful on a two-dimensional, net works to work
with data
image top that
topology
ology has a
ology.. To pro clearcess one-dimensional, sequential data, we turn next to
process grid-structured top ology and to scale suc h mo dels to
vanother
ery large p osize.
werfulThis
sp approach has
specialization
ecialization b een
of the the most
neural netw successful
networks
orks framew on
framework: a tw
ork: o-dimensional,
recurren
recurrent t neural
image
net
netw top ology. To pro cess one-dimensional, sequential data, we turn next to
works.
another p owerful sp ecialization of the neural networks framework: recurrent neural
networks.

373
Chapter 10
Chapter 10
Sequence Mo Modeling:
deling: Recurren
Recurrentt
and Recursiv
Recursivee Nets
Sequence Modeling: Recurrent
and
Recurr
current
Recursiv
ent neur
neural
e Nets
al networks or RNNs (Rumelhart et al., 1986a) are a family of
Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family of
neural net networks
works for processing sequen sequential tial data. Much as a con convolutional
volutional net network
work
R e curr ent
is a neural netw neur al
network networks
ork that is sp or RNNs
specialized ( Rumelhart et al., 1986a
ecialized for processing a grid of values X suc ) are a family
suchh asof
neural
an image,networks for processing
a recurrent neural netw sequen
networkorktial is adata.
neural Much
netw as
networkork a con
thatvolutional network
is specialized for
X
is
pro a neural netw ork that is sp ecialized
(1)
cessing a sequence of values x , . . . , x . Just as con
processing for processing
( τ ) a grid
conv of values
volutional netw suc h
networks as
orks
an image,
can readilyascale recurrent
to imagesneural netw
with ork width
large is a neural networkand
and height, that is specialized
some conv
convolutional
olutionalfor
pro
net
netwocessing
wo
works a
rks can pro sequence
process of v alues x , . . . , x
cess images of variable size, recurrent netw . Just as con
networks v olutional netw
orks can scale to muc orks
much h
can readily
longer scale to
sequences than images
would withbe large
practicalwidth forand netheight,
networks and some
works without convolutional
sequence-based
netecialization.
sp works can proMost
specialization. cess images
recurrentof vnetw
ariable
networksorks size,
canrecurrent networks
also process can scale
sequences of to much
variable
longer sequences than would be practical for networks without sequence-based
length.
specialization. Most recurrent networks can also process sequences of variable
To go from multi-la
multi-lay yer netw
networks
orks to recurren
recurrentt net networks,
works, we need to take adv advan-
an-
length.
tage of one of the early ideas found in machine learning and statistical mo models
dels of
the T o go from
1980s: multi-la
sharing yer netw
parameters orks to
across recurren
differen
different t partst netof
works,
a mo we need
model.
del. to takesharing
Parameter advan-
tagees
mak
makes of itone of the to
possible early ideasand
extend foundapplyin machine
the model learning and statistical
to examples of differentmodelsformsof
the 1980s:
(differen sharing parameters across differen t parts of
(differentt lengths, here) and generalize across them. If we had separate parameters a mo del. Parameter sharing
makeach
for es it vpalue
ossible to extend
of the and apply
time index, we could thenot model to examples
generalize of different
to sequence lengths forms
not
(differen t lengths, here) and generalize across them. If
seen during training, nor share statistical strength across different sequence lengths we had separate parameters
for each
and across value of thepositions
different time index, we could
in time. Suc
Such h not
sharing generalize to sequence
is particularly imp lengths
importan
ortan
ortant not
t when
seen
a sp during
specific training, nor share statistical strength across
ecific piece of information can o ccur at multiple positions within the sequence. different sequence lengths
Fand across different
or example, consider positions
the twtwoin time. Such“I sharing
o sentences went toisNepal particularly
in 2009” imp ortan
and “Int when
2009,
a sp
I wen ecific piece of information
wentt to Nepal.” If we ask a mac can o
machine ccur at
hine learning mo multiple model p ositions within
del to read each sen the sequence.
sentence
tence and
F or example, consider the
extract the year in which the narrator wentw o sentences “I w ent to Nepal
wentt to Nepal, we would lik in 2009” and “In 2009,
likee it to recognize
Ithe
wen t to Nepal.”
year 2009 as the relevIf we ask
relevanan a mac hine learning mo del to read
antt piece of information, whether it appears in the sixth each sen tence and
extract the year in which the narrator went to Nepal, we would like it to recognize
the year 2009 as the relevant piece of information, 374 whether it appears in the sixth
374
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

word or the second word of the sen sentence.


tence. SuppSuppose ose that we trained a feedforward
net
network
work that pro processes
cesses sen sentences
tences of fixed length. A traditional fully connected
w ord or the
feedforw
feedforward ard second
net workword
network would of hav
theesen
have tence. Supp
separate ose that
parameters forwe trained
each inputa feature,
feedforwardso it
net work that pro cesses sen tences of fixed length. A
would need to learn all of the rules of the language separately at each position in traditional fully connected
feedforw
the sen ard net
sentence.
tence. Bywork would hav
comparison, aerecurrent
separate neural
parameters netw
networkfor shares
ork each input feature,
the same weighso ts
weights it
w ould need
across several to time
learnsteps.
all of the rules of the language separately at each position in
the sentence. By comparison, a recurrent neural network shares the same weights
A related idea is the use of con convolution
volution across a 1-D temp temporaloral sequence. This
across several time steps.
con
convolutional
volutional approac approach h is the basis for time-delay neural net networks
works (Lang and
Hin A
Hinton related
ton, 1988; Waib idea is
aibel the use
el et al. of con volution
al.,, 1989; Lang et al. across a 1-D
al.,, 1990). The convtemp oral
convolutionsequence.
olution op This
operation
eration
con
allovolutional
allows
ws a net netw workapproac h is the
to share basis foracross
parameters time-delay
time, but neural networksThe
is shallow. (Lang and
output
Hincon
of tonv,olution
conv 1988; W isaib
a el et al., 1989
sequence where ; Lang
eac
each hetmemal., b1990
memb er of). the
Theoutput
convolution operation
is a function of
allo ws a
a small num net
umb w ork to
ber of neigh share
neighb parameters across time, but
boring members of the input. The idea of parameter is shallow. The output
of con volution is a sequence
sharing manifests in the application where eac h mem
of the same bercon ofvolution
the output
convolution kernelis aatfunction
each time of
a small
step. number
Recurren
Recurrent of neigh
t netw orks b
networks oringparameters
share members in of athe input.wa
different wayThe
y. Eac idea
Each of parameter
h member of the
output is a function of the previous members of the output. Each member oftime
sharing manifests in the application of the same con volution kernel at each the
step. Recurren
output is pro t
produced netw orks share
duced using the same up parameters
update in a different wa y . Eac
date rule applied to the previous outputs.h member of the
output is
This recurren a function
recurrentt form of
formulation the previous members of the output.
ulation results in the sharing of parameters through a very Each member of the
output
deep is produced using
computational graph.the same update rule applied to the previous outputs.
This recurrent formulation results in the sharing of parameters through a very
deepFor the simplicitygraph.
computational of exp
(
exposition,
t )
osition, we refer to RNNs as op operating
erating on a sequence
that con
contains
tains vectors x with the time step index t ranging from 1 to τ . In
For the
practice, recurrensimplicity
recurrent t net ofwexp
netw orksosition,
usuallyweop refer
eratetoon
operate RNNs as ophes
minibatc
minibatches erating
of suc onh asequences,
such sequence
that acon
with tains vectors
different sequence x lengthwith the τ fortimeeac
eachhstep
mem
memb index
ber of t ranging
the minibatc fromh.1 to
minibatch. . In
Weτ ha
haveve
practice, recurren t net w orks usually
omitted the minibatch indices to simplify notation. Moreov op erate on minibatc
Moreover, hes of suc h sequences,
er, the time step index
with a different sequence length τ for eac
need not literally refer to the passage of time in the real world,h mem b er of the minibatcbut h. onlyWetohatheve
omitted the minibatch
position in the sequence. RNNs ma indices to simplify
may notation.
y also be applied in twMoreov er,
two the time step
o dimensions across index
need not
spatial dataliterally
such as refer to theand
images, passage
even of whentimeapplied
in the to real world,
data inv but only
involving
olving to the
time, the
p osition
net
netwo
wo
workrk ma in ythe
may ha
havesequence.
ve connections RNNs thatma goy bac
also
backwardsbe applied
kwards in time, in provided
two dimensionsthat the across
entire
spatial data such as images, and
sequence is observed before it is provided to the netweven when applied to
network.
ork. data inv olving time, the
network may have connections that go backwards in time, provided that the entire
This chapter extends the idea of a computational graph to include cycles. These
sequence is observed before it is provided to the network.
cycles represent the influence of the presen presentt value of a variable on its own value
This c hapter
at a future time step. Sucextends the
Such h computational graphsgraph
idea of a computational allo
allowwtousinclude
to define cycles. Theset
recurren
recurrent
cycles represent
neural netw orks.the
networks. Weinfluence
then describ of the
describe presen
e many t valuet of
differen
different a variable
ways on its own
to construct, train,value
and
at a future time
use recurrent neural netw step. Suc
networks. h
orks. computational graphs allo w us to define recurren t
neural networks. We then describe many different ways to construct, train, and
For more information on recurrent neural net netw works than is available in this
use recurrent neural networks.
chapter, we refer the reader to the textb textbo ook of GravGraves es (2012).
For more information on recurrent neural networks than is available in this
chapter, we refer the reader to the textbook of Graves (2012).

375
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.1 Unfolding Computational Graphs

10.1
A Unfolding
computational graph isComputational
a wa
wayy to formalize the Graphsstructure of a set of computations,
suc
such
h as those in involv
volv
volved
ed in mapping inputs and parameters to outputs and loss.
A computational graph
Please refer to Sec. 6.5.1 is afor
waya to formalize
general the structure
introduction. In of a set
this of computations,
section we explain
such as those in volv ed in mapping inputs and
the idea of unfolding a recursive or recurrent computation in parameters to
into outputs and loss.
to a computational
Pleasethat
graph referhas
to aSec.
rep 6.5.1 for
repetitive
etitive a general
structure, introduction.
typically corresp In this to
corresponding
onding section weofexplain
a chain even
events.
ts.
the idea of unfolding a recursive or recurrent computation in
Unfolding this graph results in the sharing of parameters across a deep netwto a computational
network
ork
graph that
structure. has a rep etitive structure, typically corresp onding to a chain of events.
Unfolding this graph results in the sharing of parameters across a deep network
For example, consider the classical form of a dynamical system:
structure.
For example, consider the classical (s(t−1)of; θa),dynamical system:
s(t) = fform (10.1)

where s (t) is called the state of sthe system.


= f (s ; θ), (10.1)
Eq. 10.1 is recurrent because the definition of s at time t refers back to the
where s is called the state of the system.
same definition at time t − 1.
Eq. 10.1 is recurrent because the definition of s at time t refers back to the
Fordefinition
a finite num
number
ber of τ
same at time t time
1. steps , the graph can be unfolded by applying the
definition τ − 1 times. For example, if we unfold Eq. 10.1 for τ = 3 time steps, we
For a finite number of −
obtain time steps τ , the graph can be unfolded by applying the
definition τ 1 times. For example, if we unfold Eq. 10.1 for τ = 3 time steps, we
obtain −
s(3) =f (s(2); θ) (10.2)
(1)
s =
=ff ((fs (s ; θ;)θ); θ) (10.3)
(10.2)
Unfolding the equation by rep =f (f (s applying
repeatedly
eatedly ; θ); θ) the definition in this wa(10.3)
way
y has
yielded an expression that does not invinvolv
olv
olvee recurrence. Suc
Suchh an expression can
no
now Unfolding the
w be represen
representedequation by rep eatedly applying the definition
ted by a traditional directed acyclic computational in this way The
graph. has
yielded
unfoldedancomputational
expression that doesof not
graph Eq.inv olvand
10.1 e recurrence.
Eq. 10.3 isSuc h an expression
illustrated can.
in Fig. 10.1
now be represented by a traditional directed acyclic computational graph. The
unfolded computational graph of Eq. 10.1 and Eq. 10.3 is illustrated in Fig. 10.1.
s(... ) s(t1) s(t) s(t+1) s(... )
f f f f

Figure 10.1: The classical dynamical system described by Eq. 10.1, illustrated as an
unfolded computational graph. Eac Each
h no
node
de represen
represents
ts the state at some time t and the
Figure 10.1: The classical dynamical system described by Eq.
function f maps the state at t to the state at t + 1. The same 10.1, illustrated
parameters (the same as an
value
unfolded
of θ used computational
to parametrize graph. Eachfor
f ) are used noall
de time
represen t
ts the state at some time and the
steps.
function f maps the state at t to the state at t + 1. The same parameters (the same value
of θAs
used to parametrize
another
f are used for all time steps.
example, )let us consider a dynamical system driven by an external
signal x(t),
As another example, let uss(tconsider
) a 1)
= f (s(t− , x(t) ; θ), system driven by an external
dynamical (10.4)
signal x ,
s = f (s376 , x ; θ), (10.4)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

where we see that the state now con


contains
tains information ab
about
out the whole past sequence.
Recurren
Recurrentt neural netw
networks
orks can be built in man many y differen
differentt wa
ways.
ys. Much as
where we see that the state now contains information about the whole past sequence.
almost any function can be considered a feedforw feedforward ard neural netw
network,
ork, essentially
an
any Recurren t
y function in neural
involving networks can b e built in man
volving recurrence can be considered a recurren y differen t wa ys. netw
recurrentt neural Muchork.as
network.
almost any function can be considered a feedforward neural network, essentially
Man
Manyy recurrent neural netnetworks
works use Eq. 10.5 or a similar equation to define
any function involving recurrence can be considered a recurrent neural network.
the values of their hidden units. To indicate that the state is the hidden units of
the Man
netwyork,
recurrent
network, we nowneural net
rewrite works
Eq. 10.4use Eq.the
using 10.5 or a similar
variable equation the
h to represent to define
state:
the values of their hidden units. To indicate that the state is the hidden units of
(t) (t−1)
the network, we now rewrite hEq. =10.4
f (husing x(t) ;vθariable
, the ), h to represent the(10.5)
state:
illustrated in Fig. 10.2, typical h RNNs = f (hwill add , x extra
; θ),architectural features suc such
h as
(10.5)
output laylayers
ers that read information out of the state h to make predictions.
illustrated in Fig. 10.2, typical RNNs will add extra architectural features such as
When the recurrent netw network
ork is trained to perform a task that requires predicting
output layers that read information out of the state h to make predictions.
the future from the past, the netw networkork typically learns to use h (t) as a kind of lossy
When of
summary thethe recurrent
task-relev netw
task-relevant antorkasp
is ects
trained
aspects to ppast
of the erform a task that
sequence requires
of inputs up predicting
to t. This
the future from the past, the
summary is in general necessarily lossy netw ork typically learns to use h as a
lossy,, since it maps an arbitrary length sequencekind of lossy
(summary
x(t), x(t−1)of, xthe
(t−task-relev ant
2), . . . , x(2) asp
, x(1) ) ects
to a of the length
fixed past sequence
vector hof(t)inputs up to t. This
. Depending on the
summarycriterion,
training is in general thisnecessarily
summary migh lossy,t since
might selectivit maps
selectively ely keepan arbitrary
some asp length
aspects
ects of sequence
the past
(sequence
x , x with , x more , . precision
. . , x , xthan ) toother
a fixed
asp length
aspects.
ects. F h
or example, if the RNN on
vector
For . Depending the
is used
training
in criterion,
statistical language thismo summary
modeling, might selectiv
deling, typically ely keep
to predict someword
the next aspectsgiv of previous
given
en the past
sequence with more precision than other asp ects. F or example,
words, it may not be necessary to store all of the information in the input sequence if the RNN is used
in statistical
up language
to time t , but rathermo deling,
only enoughtypically to predict
information the next
to predict thewordrest ofgivthe
en sentence.
previous
words,
The it may
most not be necessary
demanding situationtoisstore
when all wofe the
askinformation
h(t) to be ricin hthe
rich input sequence
enough to allow
up to time t , but
one to approximately recorather only
recover enough information to
ver the input sequence, as in auto predict the
autoenco rest
enco
encoder of the sentence.
der frameworks
The most
(Chapter 14). demanding situation is when w e ask h to b e ric h enough to allow
one to approximately recover the input sequence, as in autoencoder frameworks
(Chapter 14).

Figure 10.2: A recurrent netw


network
ork with no outputs. This recurrent netw
network
ork just pro
processes
cesses
information from the input x by incorp
incorporating
orating it into the state h that is passed forward
Figure 10.2:
through time.A recurrent netwdiagram.
Circuit ork with The
no outputs. This recurrent
black square indicates netw ork of
a delay just processes
1 time step.
information from the
The same netw input
network x b y incorp orating it into the state h that is passed
ork seen as an unfolded computational graph, where each no forward
node
de is
through
no
noww assotime.
associated Circuit
ciated with one diagram.
particular The
time black square indicates a delay of 1 time step.
instance.
The same network seen as an unfolded computational graph, where each node is
nowEq.
asso10.5
ciatedcan
withbeone particular
drawn in twtime
drawn instance.
o different ways. One way to dra draw w the RNN is
with a diagram con
containing
taining one no
node
de for ev
every
ery component that migh
mightt exist in a
Eq. 10.5 can be drawn in two different ways. One way to draw the RNN is
377every component that might exist in a
with a diagram containing one node for
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

ph
physical
ysical implementation of the mo model,
del, suc
suchh as a biological neural net network.
work. In
this view, the netwnetwork
ork defines a circuit that op operates
erates in real time, with ph physical
ysical
ph ysical implementation of the mo del, suc h as a biological neural
parts whose current state can influence their future state, as in the left of Fig. 10.2 net work. In.
this view, thethis
Throughout netw ork defines
chapter, we use a circuit
a blackthat operates
square in realdiagram
in a circuit time, with physical
to indicate
parts whose current state can influence their future state, as in the
that an interaction takes place with a delay of 1 time step, from the state at time left of Fig. 10.2.
Throughout
t to the statethis chapter,
at time t + 1we use aother
. The blackwa square
way y to drainwa the
draw circuit
RNN diagram
is as an to unfolded
indicate
that an interaction
computational takes
graph, place with
in which eachacomp
each delayonent
componentof 1 time step, fromby
is represented themanstate
many at timet
y differen
different
vtariables,
to the state
with at
onetime t + 1p
variable . erThe other
time step,wa y to draw the
representing the RNN
state isof as
thean unfolded
comp
component
onent
computational graph,
at that point in time. Eac in which
Each each comp onent is represented
h variable for each time step is dra drawn by man y
wn as a separate nodifferen det
node
vofariables, with one variable
the computational graph,pas er in
time
thestep,
rightrepresenting
of Fig. 10.2.the stateweofcall
What the unfolding
component is
at that
the op p oint
operation in time. Eac h variable for each time step is dra wn as
eration that maps a circuit as in the left side of the figure to a computationala separate no de
of the computational
graph with rep
repeated graph,asasininthe
eated pieces theright
rightside.
of Fig.
The10.2 . Whatgraph
unfolded we callnow unfolding
has a sizeis
the op
that deperation
depends that maps a circuit
ends on the sequence length. as in the left side of the figure to a computational
graph with repeated pieces as in the right side. The unfolded graph now has (at)size
We can represent the unfolded recurrence after t steps with a function g :
that dep ends on the sequence length.
(t−1)
h(tthe
We can represent ) (t)
=gunfolded , x(t−2)after
(x(t), xrecurrence (2)
, x (1)
, . . . ,txsteps with) a function g(10.6)
:
(t−1) (t)
h =
=fg (h(x ,,xx ; θ,)x ,...,x ,x ) (10.7)
(10.6)
The function g (t) tak takes =f (whole
es the h ,past
x ;sequence
θ) ( x(t), x(t−1), x (t−2) , . . . , x(2)(10.7)
, x (1) )
as input and pro produces
duces the curren currentt state, but the unfolded recurren recurrentt structure
The
allo
allowsfunction g tak
ws us to factorize g in es ( the
t ) whole
into
to rep eated application of a function f . , The
past
repeated sequence ( x , x , x ...,x ,x )
unfolding
as
pro input
process
cess ththusand produces tthe
us introduces wo curren
ma jor tadv
major state,
advan
an but the unfolded recurrent structure
antages:
tages:
allows us to factorize g into repeated application of a function f . The unfolding
pro1.
cess thus introduces
Regardless two ma jorlength,
of the sequence advantages:
the learned mo model
del alw
always
ays has the same
input size, because it is specified in terms of transition from one state to
1. another
Regardless of the
state, rather sequence
than sp length,
ecifiedthe
specified learned
in terms of mo
a vdel always has history
ariable-length the same of
input
states. size, b ecause it is specified in terms of transition from one state to
another state, rather than specified in terms of a variable-length history of
2. It is possible to use the
states. transition function f with the same parameters
at every time step.
2. It is possible to use the transition function f with the same parameters
Theseat tw
twoevery timemak
o factors step.
make e it possible to learn a single model f that op operates
erates on
all time steps and all sequence lengths, rather than needing to learn a separate
These
mo del tw
model g (to) for
factors
all pmak e it time
ossible possible
steps.to learn a single
Learning a single, f thatmo
modelshared opdel
model erates on
allows
all time steps and
generalization all sequence
to sequence lengths,
lengths that rather
did notthanappear needing
in the totraining
learn a separate
set, and
mo
allodel
allows g
ws the mo for
modelall p ossible time steps. Learning a single,
del to be estimated with far fewer training examples than would shared mo del allows
be
generalization to sequence
required without parameter sharing. lengths that did not appear in the training set, and
allows the model to be estimated with far fewer training examples than would be
Both the recurren
recurrentt graph and the unrolled graph hav havee their uses. The recurrent
required without parameter sharing.
graph is succinct. The unfolded graph provides an explicit description of which
Both the recurren
computations t graph
to perform. Theandunfolded
the unrolled
graph graph
also hav
helpse their uses. The
to illustrate therecurrent
idea of
graph is succinct. The unfolded graph provides an explicit description of which
computations to perform. The unfolded 378graph also helps to illustrate the idea of
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

information flow forward in time (computing outputs and losses) and backw
backward
ard
in time (computing gradients) by explicitly sho
showing
wing the path along which this
information flows.
information flow forward in time (computing outputs and losses) and backward
in time (computing gradients) by explicitly showing the path along which this
information flows.
10.2 Recurren
Recurrentt Neural Net
Networks
works

10.2 with
Armed Recurren
the graph tunrolling
Neural andNet works
parameter sharing ideas of Sec. 10.1, we can
design a wide variet
ariety
y of recurren
recurrentt neural netw
networks.
orks.
Armed with the graph unrolling and parameter sharing ideas of Sec. 10.1, we can
design a wide variety of recurrent neural networks.

Figure 10.3: The computational graph to compute the training loss of a recurrent netw network
ork
that maps an input sequence of x values to a corresp corresponding
onding sequence of output o values.
AFigure
loss L10.3: The computational
measures ho
howw far each ograph
is fromto the
compute
correspthe training
corresponding
onding loss of
training a recurrent
target y . When netw ork
using
that maps an input sequence of x v alues to a corresp onding sequence
softmax outputs, we assume o is the unnormalized log probabilities. The loss L in of output o values.
internally
ternally
A loss L measures
computes how(far
ŷˆ = softmax
y softmax( o) and o is fromthis
eachcompares the to
corresp onding
the target y . training
The RNN target y . When
has input using
to hidden
softmax outputs,
connections we assume
parametrized byoaisweigh
the unnormalized
weight t matrix U , hidden-to-hidden The loss L
log probabilities. recurrent internally
connections
computes yˆ = by
parametrized softmax (o)t and
a weigh
weight compares
matrix W , and this to the target y . The
hidden-to-output RNN hasparametrized
connections input to hiddenby
aconnections
weight matrix parametrized by defines
V . Eq. 10.8 a weighforward U
t matrixpropagation
, hidden-to-hidden
in this mo recurrent
model.
del. connections
The RNN
parametrized
and by a with
its loss drawn weighrecurrent W , and hidden-to-output
t matrix connections. The sameconnections parametrized
seen as an time-unfoldedby
a w eight matrix V . Eq.
computational graph, where eac 10.8 defines
eachh no
node forward
de is no
now propagation
w asso
associated in this mo del. The
ciated with one particular time instance.RNN
and its loss drawn with recurrent connections. The same seen as an time-unfolded
computational graph, w here eac h no de is
Some examples of important design patterns for no w asso ciated withrecurren
one particular
recurrent timenet
t neural instance.
netw works
include the following:
Some examples of important design patterns for recurrent neural networks
• Recurren
Recurrent
include t netw
networks
the following: orks that pro
produce
duce an output at each time step and ha
have
ve
379an output at each time step and have
Recurrent networks that produce

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

recurren
recurrentt connections betw
etween
een hidden units, illustrated in Fig. 10.3.
• Recurren
Recurrent
recurrenttconnections
netw
networks
orks that betwproproduce
eenduce
hiddenan units,
outputillustrated
at each time in Fig. step10.3and
. ha haveve
recurren
recurrentt connections only from the output at one time step to the hidden
Recurren
units at the t netw nextorks
timethat step,pro duce an output
illustrated in Fig. at 10.4each time step and have
• recurren t connections only from the output at one time step to the hidden
• Recurren
Recurrent
units at the t net
networks
works
next time withstep,recurrent connections
illustrated in Fig. 10.4 betw
etween
een hidden units, that
read an en entire
tire sequence and then pro produce
duce a single output, illustrated in Fig.
Recurren
10.5
10.5.. t networks with recurrent connections between hidden units, that
• read an entire sequence and then produce a single output, illustrated in Fig.
Fig. 10.310.5.is a reasonably represen representative tative example that we return to throughout
most of the chapter.
Fig. 10.3 is a reasonably representative example that we return to throughout
The recurren
recurrentt neural netw network ork of Fig. 10.3 and Eq. 10.8 is universal in the
most of the chapter.
sense that any function computable by a Turing machine can be computed by such
The recurren
a recurrent netw
network t neural
ork network
of a finite size.ofTheFig.output
10.3 and can Eq.
be read10.8fromis universal
the RNNinafter the
asense
num
numb that
ber ofany function
time steps computable by a Turinglinear
that is asymptotically machine can num
in the be computed
numb ber of time bysteps
such
a recurrent netw ork of a finite size. The output can
used by the Turing machine and asymptotically linear in the length of the input b e read from the RNN after
(aSiegelmann
number of time and Sontagsteps that
, 1991 is; asymptotically
Siegelmann, 1995 linear in the numand
; Siegelmann ber Sontag
of time, 1995
steps;
used
Hy by the, T1996
Hyotyniemi
otyniemi uring ). machine
The functions and asymptotically
computable bylinear in the
a Turing length are
machine of the input
discrete,
(soSiegelmann
these results andregardSontag , 1991
exact ; Siegelmann
implemen
implementationtation , 1995
of the; Siegelmann
function, not and Sontag, 1995;
approximations.
Hy otyniemi , 1996 ).
The RNN, when used as a Turing mac The functions computable
machine, by a T uring machine
hine, takes a binary sequence as input are discrete,
and its
so these results regard
outputs must be discretized to pro exact implemen
provide tation of the function, not approximations.
vide a binary output. It is possible to compute all
The RNN,inwhen
functions this used as ausing
setting Turing machine,
a single sp takesRNN
specific
ecific a binary sequence
of finite as input andand
size (Siegelmann its
outputs
Son tag (must
Sontag 1995)buse e discretized
886 units). to The
provide a binary
“input” of theoutput.
TuringItmachine
is possible is atosp compute
specification
ecificationall
functions in this setting using
of the function to be computed, so the same netw a single sp ecific RNN
network of finite size ( Siegelmann
ork that simulates this Turing and
Son
mac tag
hine is sufficient for all problems. The theoretical machine
machine ( 1995 ) use 886 units). The “input” of the T uring RNN used is aforspecification
the pro
proofof
of the
can simfunction
ulate an to
simulate be computed,
unbounded stackso bythe same netwits
representing orkactiv
that simulates
activations
ations and weighthis ts
weights Turing
with
machine num
rational is sufficient
numb bers of un for
unb all problems.
bounded precision. The theoretical RNN used for the proof
can simulate an unbounded stack by representing its activations and weights with
We nonow w dev
develop elop the forwforward ard propagation equations for the RNN depicted in
rational numbers of unbounded precision.
Fig. 10.3. The figure does not sp specify
ecify the choice of activ activation
ation function for the
hiddenWe no w dev
units. elopwthe
Here forwardthe
e assume propagation
hyperb olicequations
hyperbolic tangent activ for the
activation
ation RNN depicted
function. in
Also,
Fig.figure
the 10.3. do The
does es notfigure sp does exactly
specify
ecify not specify whattheformchoice of activand
the output ationloss function
function fortakthe
take.
e.
hidden units. Here w e assume the hyperb olic tangent
Here we assume that the output is discrete, as if the RNN is used to predict words activ ation function. Also,
thecharacters.
or figure doesAnot specify
natural wa
way yexactly what discrete
to represent form thevariables
output and is toloss
regardfunction take.
the output
oHere we assume
as giving that the output
the unnormalized log isprobabilities
discrete, as of if each
the RNN is used
possible value to of
predict words
the discrete
vor characters.
ariable. We can A natural
then applyway to therepresent
softmaxdiscrete
op
operation
erationvariables is to regard
as a post-pro
ost-processing
cessing thestep
outputto
o as giving
obtain a vector ythe unnormalized log probabilities
ŷˆ of normalized probabilities ov of
over each p ossible v alue of
er the output. Forward propagation the discrete
vbariable. W e
egins with a sp can then apply the softmax op eration
ecification of the initial state h(0) . Then,
specification as a pforost-pro
each cessing
time step step to
from
tobtain
= 1 toa tvector
= τ , we yˆ ofapply
normalized probabilities
the following up
update
dateovequations:
er the output. Forward propagation
begins with a specification of the initial state h . Then, for each time step from
t = 1 to t = τ , we apply the a(t) following
= b + up h(t−1)equations:
W date + U x (t) (10.8)

a 380h
= b+W + Ux (10.8)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.4: An RNN whose only recurrence is the feedback connection from the output
to the hidden lay layer.
er. At each time step t , the input is x , the hidden lay layer
er activ
activations
ations are
Figure
h , the10.4: An RNN
outputs are owhose
, the only recurrence
targets are y isand thethefeedback
loss is Lconnection
. from thediagram.
Circuit output
to the hidden lay er. At each time step t , the input
Unfolded computational graph. Such an RNN is less pow is x , the hidden erful (can expressarea
lay
powerful er activ ations
h , theset
smaller outputs are o
of functions) , than
the targets
those in y
arethe family
and the loss is
represented L . Fig. 10.3
by Circuit
. Thediagram.
RNN
Unfolded
in Fig. 10.3 computational
can choose to put ananygraph.
y informationSuch an it RNN
wan
wants ts isabless
out pow
about the erful
past in(can
into
to itsexpress
hiddena
smaller
represen set
representation of functions)
tation h and transmit than those
h to in thethe familyThe
future. represented
RNN in by thisFig. 10.3is. trained
figure The RNN to
in Fig.
put a sp 10.3
specificcan choose to put an y information it wan ts ab
ecific output value into o , and o is the only information it is allow out the past into
allowed its
ed to hidden
send
represen
to the future. h and are
tation There transmit h toconnections
no direct the future.from ThehRNN goinginforw
thisard.
forward. figureTheis previous
trained toh
put
is a sp ecifictooutput
connected the present only oindirectly
value into , and o is
indirectly, thethe
, via only information
predictions it isused
it was allowtoedproto duce.
send
produce.
to the future.
Unless o is veryThere are no direct and
high-dimensional connections
ric
rich,
h, it will h going
fromusually forw
lack ard.
imp
important
ortant previous h
The information
is connected
from the past.toThis
the present
makes the only
RNNindirectly
in this, figure
via theless predictions
pow erful, itbut
owerful, wasit used
ma
may y btoe easier
pro duce.
to
train o is very
Unlessbecause eachhigh-dimensional
time step can beand rich,initisolation
trained will usually
from lack imp ortant
the others, information
allowing greater
from the past. during
parallelization This makes the as
training, RNN in this
describ
described ed infigure
Sec.less pow
10.2.1 . erful, but it may be easier to
train because each time step can be trained in isolation from the others, allowing greater
parallelization during training, as described in Sec. 10.2.1.

381
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

h(t) = tanh( tanh(a a(t)) (10.9)


(t) (t)
o = c+Vh (10.10)
h = tanh(a ) (10.9)
ŷˆ(t) = softmax(
y softmax(o o(t) ) (10.11)
o = c+Vh (10.10)
where the parameters areyˆthe bias = vectors softmax( b oand) c along with the weigh (10.11)
weightt matrices
U , V and W , resprespectively
ectively for input-to-hidden, hidden-to-output and hidden-to-
where
hidden connections. are
the parameters Thisthe is bias
an example vectors bofand c along with
a recurrent netwthe
orkweigh
network thatt maps
matrices
an
U , V sequence
input and W , resp ectively
to an output forsequence
input-to-hidden, of the same hidden-to-output
length. The totaland hidden-to-
loss for a
hidden
giv
given connections.
en sequence This paired
of x values is an example with a sequenceof a recurrent netwwould
of y values ork thatthenmaps an
be just
input
the sum sequence to an ooutput
of the losses sequence
ver all the time steps. of theFsame length.ifThe
or example, L(t) total
is theloss for a
negative
given
log-liksequence
log-likeliho
eliho
elihoo of x values
od of y(t) giv paired with
en x (1) , . . . , x(t), then
given a sequence of y values would then b e just
the sum of the losses over all the time steps. For example, if L is the negative
log-likelihood of y giv L en{xx(1), ., .. .. ,. x
, x(τ ) },,then
{y(1), . . . , y (τ ) } (10.12)

=L Lx(t) , . . . , x , y ,...,y (10.12)


(10.13)
t { } { }
= L (10.13)
=− log pmodel y | {x , . . . , x (t)} ,
(t) (1)
(10.14)
t
= log p y x ,...,x , (10.14)
( t ) (1) ( t ) ( t )
where pmodel y | {x− , . . . , x } is giv givenen| by
{ reading the} en try for y from the
entry
mo
model’s
del’s output vector y ( t )
ŷˆ . Computing the gradien gradientt of this loss function with
where p y x ,...,x is given by reading the entry for y from the
resp
respect
ect to the parameters is an exp expensiv
ensiv
ensivee op
operation.
eration. The gradient computation
mo
in del’s
involv
volv
volves
es poutput
erforming vector yˆ . Computing
| { a forward }
propagation the
passgradien
movingt ofleft
thistoloss
rightfunction
righ t throughwith our
resp ect to the parameters is an exp
illustration of the unrolled graph in Fig. 10.3, follo ensiv e op eration.
followed The gradient
wed by a backw
backward computation
ard propagation
in volv es p erforming a forward propagation pass
pass moving right to left through the graph. The runtime is O ( τ) and moving left to righ t through
cannot our
be
illustration of the unrolled graph
reduced by parallelization because the forw in Fig. 10.3 ,
forward followed by a backw ard
ard propagation graph is inheren propagation
inherently
tly
pass moving
sequen
sequential; right
tial; each to left
time stepthrough
ma
may y only thebgraph.
e computedThe runtime O ( τ) andone.
after theis previous cannot be
States
reduced byinparallelization
computed the forward pass becausemustthe be forw ardun
stored propagation
til they aregraph
until reusedis during
inherenthe
tly
sequen
bac tial; each time step ma y only b e computed after
kward pass, so the memory cost is also O (τ ). The back-propagation algorithm
backward the previous one. States
applied to in
computed thethe forward
unrolled graphpasswith mustOb (τe)stored
cost isun til they
called are op
back-pr
ack-prop reused
opagation
agationduring
thr the
through
ough
backward
time or BPTTpass, and
so the memory cost
is discussed furtheris also
Sec.O (10.2.2
τ ). The back-propagation
. The netw
network algorithm
ork with recurrence
applied
b etween to
etween the unrolled
hidden units is thus graph with
very poO (τ ) cost
werful but is called
also exp back-pr
expensive
ensive tooptrain.
agation through
Is there an
time or
alternativ BPTT
alternative? e? and is discussed further Sec. 10.2.2 . The netw ork with recurrence
between hidden units is thus very powerful but also expensive to train. Is there an
alternative?

The net
network
work with recurrent connections only from the output at one time step to
the hidden units at the next time step (shown in Fig. 10.4) is strictly less powerful
The network
because withhidden-to-hidden
it lacks recurrent connections only
recurren
recurrent from the output
t connections. For at one time
example, it step to
cannot
the
sim hidden
simulate units at the next time
ulate a universal Turing machi step
machine. (shown in Fig.
ne. Because this net 10.4
network ) is strictly less p o
work lacks hidden-to-hiddenwerful
because it lacks hidden-to-hidden recurrent connections. For example, it cannot
382
simulate a universal Turing machine. Because this network lacks hidden-to-hidden
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

recurrence, it requires that the output units capture all of the information ab about
out
the past that the netw
network
ork will use to predict the future. Because the output units
recurrence,
are explicitlyittrained
requirestothat
matc
matchthe output
h the units
training setcapture
targets,all of the
they information
are unlikely about
to capture
the past
the that the
necessary network will
information ab use the
about
out to predict the future.
past history of theBecause
input, the output
unless the units
user
are
kno explicitly
knows
ws ho
how trained
w to describ to matc h the training set targets,
describee the full state of the system and pro they are
provides unlikely to capture
vides it as part of the
the necessary
training information
set targets. The adv aban
advan outtage
antagetheofpast history of
eliminating the input, unless
hidden-to-hidden the user
recurrence
kno
is ws ho
that, w any
for to describ e the full
loss function stateonofcomparing
based the systemthe andprediction
provides it atas part
time of the
t to the
training target
training set targets.
at timeThe advthe
t, all antage
timeofsteps
eliminating hidden-to-hidden
are decoupled. Training can recurrence
thus be
parallelized, with the gradient for each step t computed in isolation. Theretoisthe
is that, for any loss function based on comparing the prediction at time t no
training
need target atthe
to compute time t, all for
output thethe
time steps are
previous decoupled.
time step first,Tbraining
ecause can thus be
the training
parallelized,
set withideal
provides the the v
gradient for each
alue of that step t computed in isolation. There is no
output.
need to compute the output for the previous time step first, because the training
set provides the ideal value of that output.

Figure 10.5: Time-unfolded recurren


recurrentt neural netw
network
ork with a single output at the end
of the sequence. Such a netwnetwork
ork can b e used to summarize a sequence and pro produce
duce a
Figure 10.5:
fixed-size Time-unfolded
representation usedrecurren
as inputt neural netwpro
for further orkcessing.
with a single
processing. There output
might bate athe end
target
of the
right atsequence.
right Such
the end (as a netwhere)
depicted ork can b e gradient
or the used to summarize a sequence
on the output o can band pro ducebya
e obtained
fixed-size
bac representation
back-propagating
k-propagating from used
further as input
downstreamfor further
mo
modules.pro
dules. cessing. There might b e a target
right at the end (as depicted here) or the gradient on the output o can be obtained by
back-propagating
Mo
Models frome further
dels that hav
have downstream
recurrent connections modules.
from their outputs leading bac back k in
into
to
the model ma may y be trained with te teacher
acher for
forcing
cing
cing.. Teacher forcing is a procedure
Mo dels that hav e recurrent connections
that emerges from the maximum likelihoo
likelihood from their
d criterion, in outputs
whic
which leading
h during back in
training to
the
the
mo
modelmodel
del may the
receives be trained
ground with
truth teacher yfor
output (t)cing. Teacher forcing
as input at time t + 1is. aW
Weprocedure
e can see
that emerges from the maximum likelihoo d criterion, in which during
this by examining a sequence with two time steps. The conditional maximum training the
mo
lik del
likeliho
elihoreceives
elihoo the ground
od criterion is truth output y as input at time t + 1. W e can see
this by examining a sequence with two time steps. The conditional maximum
(1) (2)
likelihood criterion
log p y is , y | x(1) , x (2) (10.15)

log p y ,y x ,x (10.15)
383
|
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.6: Illustration of teacher forcing. Teacher forcing is a training tec technique
hnique that is
applicable to RNNs that ha have
ve connections from their output to their hidden states at the
Figure
next 10.6:
time step.Illustration
At of teacher
train time,forcing.
we feedTthe
eacher forcing is a ytraining
dra tecfrom
drawn
wn hnique
thethat is
train
applicable
set as inputtotoRNNs
h that. have connections
When the mo from
del their
model output
is deployed, to
deployed, thetheir
truehidden
outputstates at the
is generally
nextkno
not time
wn.step.
known. In this case,Atwtrain time, we feed
e approximate correct output y ywith
the the drawnmo
the from
del’sthe
model’s train
output
set as input to h .
o , and feed the output back in When
into the
to the mo mo
model.del
del. is deployed, the true output is generally
not known. In this case, we approximate the correct output y with the model’s output
o , and feed the output back into the model.

384
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

= log p y (2) | y (1), x (1), x (2) + log p y (1) | x (1), x(2) (10.16)

In this example,= log p we y see ythat, xat time, x t =+22,log , thep moy delxis trained
model ,x to maximize (10.16)the
conditional probability |of y giv (2) givenen the x sequence | so far and the previous y
In this example, we see that at
value from the training set. Maximum likelihoo time t = 2,
likelihoodthe mo del
d thus sp is trained
specifies
ecifies thattoduring
maximize
training,the
conditional
rather thanprobability
feeding theof model’s
y given own output the xbac sequence
back k into so far and
itself, thesetheconnections
previous y
vshould
alue from the with
be fed training
the set.
targetMaximum
values sp likelihoo
ecifyingdwhat
specifying thus sp
theecifies
correctthat duringshould
output training,be.
rather than feeding the
This is illustrated in Fig. 10.6. model’s own output bac k into itself, these connections
should be fed with the target values specifying what the correct output should be.
We originally motiv ated teacher forcing as allo
motivated wing us to av
allowing oid back-propagation
avoid
This is illustrated in Fig. 10.6.
through time in mo models
dels that lack hidden-to-hidden connections. Teac eacher
her forcing
ma
may W e originally motiv
y still be applied to mo ated
modelsteacher
dels that havforcing as allowing us to av oid back-propagation
havee hidden-to-hidden connections so long as
through
they ha have time in mo dels that lack hidden-to-hidden
ve connections from the output at one time step connections. Teacher forcing
to values computed in the
ma y still b e
next time step. Ho applied
Howevto
wev
wever,mo dels that hav e hidden-to-hidden connections
er, as soon as the hidden units become a function of earlier so long as
they ha
time ve connections
steps, the BPTT from the output
algorithm at one .time
is necessary
necessary. Some step
mo to
models values
dels ma
may y computed
th
thus in the
us be trained
next btime
with oth step.
teacher Hoforcing
wever, as andsoon
BPTT. as the hidden units become a function of earlier
time steps, the BPTT algorithm is necessary. Some models may thus be trained
withThe b oth disadv
disadvan
teacheran
antage
tage of strict
forcing teacher forcing arises if the net
and BPTT. network
work is going to be
later used in an op open-lo
en-lo
en-loopop mo
mode,de, with the netw network ork outputs (or samples from the
The distribution)
output disadvantagefed of strict
bac
backk asteacher
input.forcing
In thisarises
case,ifthe
thekindnetwork is going
of inputs that tothe
be
later
net
network
workused in during
sees an open-lo op mocould
training de, with the netw
be quite ork outputs
different from the (orkind
samples fromthat
of inputs the
output distribution) fed
it will see at test time. One wa bac k as
way input. In this case, the kind of
y to mitigate this problem is to train with both inputs that the
net
teac work
teacher-forcedsees
her-forced during
inputs training
and withcould be quite different
free-running inputs, for from the kind
example of inputs that
by predicting the
it will see at
correct target a num test time.
numb One wa y to mitigate this problem
ber of steps in the future through the unfolded recurrentis to train with b oth
teac her-forced
output-to-input paths. inputs andInwith
this free-running
way
ay,, the net netwinputs,
work can for learn
example by epredicting
to tak
take into account the
correctconditions
input target a num (such beras of steps
those it in the future
generates itselfthrough the unfoldedmo
in the free-running recurrent
mode)
de) not
output-to-input paths.
seen during training and ho In
howthis w ay , the net
w to map the state bac w ork can
backk tolearn
towards to tak e into
wards one that will mak account
makee
input
the net conditions
network
work generate (suchpropas er
proper those it generates
outputs after a few itself in the
steps. free-running
Another approach mo(de)
Bengionot
seen
et al.,during
al. , 2015btraining and ho
) to mitigate thew gap
to map betw the
etweeneenstate back toseen
the inputs wards at one
trainthat
timewilland makthee
the net work generate prop er outputs after a few steps.
inputs seen at test time randomly chooses to use generated values or actual data Another approach ( Bengio
vetalues
al., as
2015b ) toThis
input. mitigate the gap
approach betwaeen
exploits the inputs
curriculum seen at
learning train time
strategy and the
to gradually
inputs
use more seenof at
thetest time randomly
generated values aschooses
input. to use generated values or actual data
values as input. This approach exploits a curriculum learning strategy to gradually
use more of the generated values as input.

Computing the gradien


gradientt through a recurren
recurrentt neural netw
network
ork is straightforw
straightforward.
ard.
One simply applies the generalized bac
back-propagation
k-propagation algorithm of Sec. 6.5.6 to the
Computing the gradien t through
unrolled computational graph. No sp a recurren
specialized
ecializedt neural netware
algorithms ork necessary
is straightforw
necessary. . The ard.
use
One
of simply applies the
back-propagation ongeneralized bac
the unrolled k-propagation
graph algorithm
is called the back-profop
ack-prop Sec. 6.5.6thr
opagation
agation toough
the
through
unrolled
time computational
(BPTT) algorithm.graph. No spobtained
Gradients ecialized by
algorithms are necessary
back-propagation may. The
thenusebe
of back-propagation on the
used with any general-purp
general-purposeunrolled graph is called the back-pr opagation
ose gradient-based techniques to train an RNN. thr ough
time (BPTT) algorithm. Gradients obtained by back-propagation may then be
385
used with any general-purpose gradient-based techniques to train an RNN.
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

To gain some intuition for how the BPTT algorithm behav ehaves,es, we provide an
example of how to compute gradien gradients ts by BPTT for the RNN equations ab abov
ov
ovee
T o gain some intuition for how the BPTT
(Eq. 10.8 and Eq. 10.12). The nodes of our computational graph include thealgorithm b ehav es, we provide an
example of U
parameters how
, Vto , W compute
, b and cgradienas well ts asbythe
BPTT sequencefor the RNN
of no des equations
nodes indexed byab t ov
fore
(Eq. 10.8 and Eq. 10.12 ).
x(t) , h(t) , o (t) and L(t) . For each no The nodes of our computational
de N we need to compute the gradient ∇ L
node graph include the
parameters
recursiv
recursively ely U , V , on
ely,, based Wthe, b and
gradienc ast well
gradient computedas theatsequence
no
nodes
des that of no des indexed
follow t for
it in thebygraph.
N
Wx e ,start
h ,the o recursion
and L .with For theeachno no
nodesdeimmediately
des we need topreceding computethe thefinal
gradient
loss L
recursively, based on the gradient computed at nodes that follow it in the graph. ∇
We start the recursion with the nodes ∂ Limmediately preceding the final loss
= 11.. (10.17)
∂ L(t)
∂L
In this deriv
derivation
ation we assume that the∂ Loutputs = 1. o(t) are used as the argument(10.17) to the
softmax function to obtain the vector y ŷˆ of probabilities ov overer the output. We also
In this deriv ation we assume that
assume that the loss is the negative log-likelihoo the outputs
log-likelihood o are used
d of the true as the argument
target y(t) giv to the
given
en the
softmax function to obtain the vector yˆ of probabilities
input so far. The gradient ∇ L on the outputs at time step t, for all i, t , is as ov er the output. W e also
assume
follo ws: that the loss is the negative log-likelihood of the true target y given the
follows:
input so far. The gradient ∂ LL on the ∂ L outputs
∂ L(t) at (time t) step t, for all i, t , is as
follows: ( ∇ L )i = = = ŷ
yˆ i − i,y . (10.18)
∇ ∂ o(t) ∂ L(t) ∂ o(t)
∂L i ∂ L ∂ Li
We work our wa way (
y backw L
backwards,) = =
ards, starting∂from = yˆof the sequence. . (10.18)
∂ o L ∂the o
end At the final
(τ ) ∇ (τ ) −
time step τ , h only has o as a descendent, so its gradient is simple:
We work our way backwards, starting from the end of the sequence. At the final
time step τ , h only ∂ o(τ )
∇has Lo = ((∇ as
∇ a descendent,
L) (τ ) =so ∇its gradient
((∇ L) V . is simple: (10.19)
∂h
∂o
We can then iterate backw L
backwards = (in time
ards L)to back-propagat
=(
back-propagate Le) V .
gradients through(10.19)
time,
∂ h (t)
from t = τ − 1 do down
wn ∇ to t = 11,, noting
∇ that h (for∇t < τ ) has as descenden descendents ts both
W
o(et) can
andthen
h(t+1)iterate backwards
. Its gradient in time
is thus given to bback-propagat
y e gradients through time,
from t = τ 1 down to t = 1, noting that h (for t < τ ) has as descendents both
o and∇h − L .=Its ∂ h(t+1) given by ∂ o(t)
((∇∇gradientL) is thus + ( ∇ L) (10.20)
∂ h(t) ∂ h(t)
∂h 2∂ o
L= = ((∇
(∇ L) + ( (t+1)L ) (10.20)
L) diag ∂h 1 − h ∂ hW + (∇ L) V (10.21)
∇ ∇ ∇
=( 2L) diag 1 h W +( L) V (10.21)
where diag 1 − h(t+1) indicates the diagonal matrix con containing
taining the elements
∇ − ∇
(t+1) 2
− (hidiag) .1 Thish is the Jacobian
1where indicatesofthe thediagonal
hyperb
hyperbolic olic tangent
matrix asso
associated
containing ciated with the
the elements
hidden unit i at − time t + 1.
1 (h ) . This is the Jacobian of the hyperbolic tangent associated with the
Once the gradients on the internal no nodes
des of the computational graph are ob-
hidden
− unit i at time t + 1.
tained, we can obtain the gradien gradients ts on the parameter no nodes,des, which ha havve descenden
descendents ts
Once the gradients
at all the time steps: on the internal no des of the computational graph are ob-
tained, we can obtain the gradients on the parameter nodes, which have descendents
at all ∇ theLtime ∂ o(t)
= steps:(∇ L) = ∇ L
t
∂ c t
∂o
L = ( L) = 386 L
∂c
∇ ∇ ∇
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

∂ h (t) 2
∇L = (∇ L) = (∇ L) diag 1 − h(t)
t
∂b t
∂h
L = ( L) ∂ o(t) = ( L) diag 1 h
∇∇ L = (∇ L ) ∂b = ( ∇ L ) h (t)
∇ ∂V ∇ −
t ∂o t
L = ( L) ∂ h (t) = ( L) h 2
∇ L = ( ∇ L ) ∂ V = ( ∇ L ) diag 1 − h (t)
h(t−1)
∇ ∇ ∂ W ∇
t ∂h t
L = ( L) ∂ h (t) = ( L) diag 1 h 2 h
∇ L = (∇ ∂ W (∇ (t)
x(t)
∇ ∇ L) ∂ U = ∇ L) diag 1 − − h
t ∂h t
L = ( L) = ( L) diag 1 h x
∂U (t )
We do ∇ not need to compute∇ the gradient with ∇ resp
respectect to x− for training because
it do
does
es not hav
havee an
anyy parameters as ancestors in the computational graph defining
W e do
the loss.not need to compute the gradient with respect to x for training because
it does not have any parameters as ancestors in the computational graph defining
We are abusing notation somewhat in the ab abovovee equations. We correctly use
ov
the loss.
∇ L to indicate the full influence of h through all paths from h (t) to L. This
( t )
We are abusing notation somewhat in the above equations. We correctly use
is in contrast to our usage of ∂∂ or ∂ , whic
which
h we use here in an unconv unconven en
entional
tional
L to indicate the full influence of∂h through all paths from h to L. This
∂ (t )
manner. By ∂ we refer to the effect of W on h only via the use of W at time

is in contrast to our usage of or , which we use here in an unconventional
step t. This is not standard calculus notation, because the standard definition of
manner.
the By would
Jacobian we actually
refer to the effectthe
include of W on h influence
complete only via theof Wuseon W(t)atvia
of h time
its
stepint. all
use This is not
of the standardtime
preceding calculus
steps notation,
to pro
produce because
duce h(t−1) .the
Whatstandard definition
we refer to here ofis
the Jacobian
in fact the would actually
metho
method include the complete influence of W
d of Sec. 6.5.6, that computes the contribution of a single on h via its
edge in the computational time
use in all of the preceding graphstepsto theto gradient.
produce h . What we refer to here is
in fact the method of Sec. 6.5.6, that computes the contribution of a single
edge in the computational graph to the gradient.

In the example recurrent netw networkork we havhavee developed so far, the losses L(t) were
cross-en
cross-entropies
tropies betw een training targets y(t) and outputs o(t) . As with a feedforward
between
In
net
netwo the
wo
work, example
rk, it is in recurrent
principle pnetw orktoweuse
ossible hav e developed
almost any losssowith
far, athe losses Lnetw
recurrent work.
ere
network.
cross-en tropies betw een training targets y and outputs
The loss should be chosen based on the task. As with a feedforward netw o . As with a feedforward
network,
ork, we
network,wish
usually it istoininterpret
principle the
p ossible
output to of
usethealmost
RNN anyas aloss with a recurrent
probability netwand
distribution, ork.
The
w loss should
e usually use the becross-entrop
chosen based
cross-entropy on ciated
y asso the task.
associated withAs with
that a feedforward
distribution netwthe
to define ork,loss.
we
usuallysquared
Mean wish toerrorinterpret
is thethe output ofythe
cross-entrop
cross-entropy lossRNN
asso as a probability
associated
ciated distribution,
with an output and
distribution
we usually use the cross-entrop y asso ciated with that
that is a unit Gaussian, for example, just as with a feedforward netw distribution to define
network.
ork. the loss.
Mean squared error is the cross-entropy loss associated with an output distribution
thatWhen we use
is a unit a predictive
Gaussian, log-lik
log-likeliho
for example, eliho
elihoo o d as
just training
with aobobjectiv
jective, suchnetw
jective,
feedforward as Eq.ork.10.12, we
train the RNN to estimate the conditional distribution of the next sequence elemen elementt
( t
y giv ) When
given we use a predictive log-lik eliho o d training ob jectiv
en the past inputs. This may mean that we maximize the log-likelihoo e, such as Eq. 10.12
log-likelihood,dwe
train the RNN to estimate the conditional distribution of the next sequence element
y given the past inputs. This (y (t) mean
log pmay | x(1) ,that
...,x (t)
we )maximize
, (10.22)
the log-likelihoo d
log p(y 387
x , . . . , x ), (10.22)
|
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

or, if the mo
modeldel includes connections from the output at one time step to the next
time step,
or, if the model includes log pconnections
(y(t) | x (1) , . from
. . , x(the
t) output
, y (1) , . . . ,at one
y(t− 1) time step to the next
). (10.23)
time step,
Decomp
Decomposing
osing the join joint
logt pprobability
(y x , .ov over
. .er , . . . , y of y )v. alues as a series
, xthe, ysequence of
(10.23)
one-step probabilistic predictions is one wa wayy to capture the full joint distribution
|
across the whole sequence. When we do notsequence
Decomp osing the join t probability ov er the feed pastofyyvvalues aluesasasinputs
a series
thatof
one-step probabilistic
condition the next steppredictions
prediction,isthe onedirected
way to graphical
capture the mo full
model joint
del con distribution
contains
tains no edges
across
from an the
anyy y in the past to the current y . In this case, the outputs y that
whole
( i ) sequence. When w e do not( t) feed past y values as inputs are
condition the
conditionally indep next step prediction, the directed graphical
endent given the sequence of x values. When we do feed the
independent mo del con tains no edges
from an y y in the
actual y values (not their pastprediction,
to the current but the y actual. In this observcase,
observed ed or the outputs vyalues)
generated are
conditionally
bac
backk in to the indep
into netw endent
network,
ork, the given the graphical
directed sequence mo of xdel
model values. When
contains we from
edges do feed the
all y ( i)

vactual
alues iny vthe
alues past(not
to their prediction,
the current y(t) vbut alue.the actual observed or generated values)
back into the network, the directed graphical model contains edges from all y
values in the past to the current y value.

Figure 10.7: Fully connected graphical mo model


del for a sequence y , y , . . . , y , . . . : ev every
ery
past observ
observation
ation y ma mayy influence the conditional distribution of some y (for t > i),
Figure
giv
given
en the10.7: Fully connected
previous graphical mothe
values. Parametrizing del graphical
for a sequence
mo
modely directly
del , y , . .according
. , y , . . . :toevthis
ery
past observ
graph (as ination y
Eq. 10.6)ma y influence
might be verythe conditional
inefficient, withdistribution
an ever gro of some
growing
wing num
numb y ber(for t >
of inputs i),
givenparameters
and the previous
for veach
alues. Parametrizing
element the graphical
of the sequence. RNNs mo del directly
obtain the sameaccording to this
full connectivity
graph (as in Eq. 10.6 ) might b e very inefficient, with
but efficient parametrization, as illustrated in Fig. 10.8. an ever gro wing num b er of inputs
and parameters for each element of the sequence. RNNs obtain the same full connectivity
but As
efficient parametrization,
a simple example, letasus illustrated
considerintheFig.case
10.8where
. the RNN mo models
dels only a
sequence of scalar random variables = {y(1) , . . . , y (τ )}, with no additional inputs
As ainput
x . The simple example,
at time step tlet
is us consider
simply the case
the output at where thet −
time step RNN models
1. The RNNonly
thena
sequencea of
defines scalar random
directed graphicalvariables
mo
model = the
del over y y, .v.ariables.
.,y , with no additional
We parametrize theinputs
joint
xdistribution
. The input of at these
time step t
observ is simply
observations the output
{ at time
} step t 1 . The
ations using the chain rule (Eq. 3.6) for conditionalRNN then
defines a directed
probabilities: graphical model over the y variables. We parametrize − the joint
distribution of these observations using the chain rule (Eq. 3.6) for conditional
τ
probabilities:
P( ) = P( (1)
,..., (τ )
) = P ( (t) | (t−1), (t−2) , . . . , (1) ) (10.24)
t=1
P( ) = P( ,..., )= P 388
( , ,..., ) (10.24)
|
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

where the righ


right-hand
t-hand side of the bar is empty for t = 1, of course. Hence the
negativ
negativee log-likelihoo
log-likelihoodd of a set of values {y (1) , . . . , y (τ )} according to such a mo
model
del
where
is the righ t-hand side of the bar is empty for t = 1, of course. Hence the
negative log-likelihood of a set of vL = y L (,t.) . . , y
alues according to such a(10.25)
model
is {t }
L= L (10.25)
where
L (t) = − log P (y(t) = y (t) | y(t−1), y (t−2) , . . . , y (1)). (10.26)
where
L = log P (y =y y ,y ,...,y ). (10.26)
− |

Figure 10.8: Introducing the state variable in the graphical model of the RNN, ev even
en
though it is a deterministic function of its inputs, helps to see how we can obtain a very
Figure
efficientt10.8:
efficien Introducing based
parametrization, the state variable
on Eq. 10.5. in
Evthe
ery graphical
Every model
stage in the of the(for
sequence RNN,
h ev en
and
though
y ) in it isesathe
involv
volv
volves deterministic function
same structure (the of its num
same inputs,
ber helps
number to see
of inputs forhow
eachwe
no can and
node)
de) obtain
canashare
very
efficien
the samet parametrization,
parameters with based on Eq.
the other 10.5. Every stage in the sequence (for h and
stages.
y ) involves the same structure (the same number of inputs for each node) and can share
the The
sameedges
parameters with themo
in a graphical other
del stages.
model indicate which variables depend directly on other
variables. Many graphical models aim to achiev achievee statistical and computational
The edges in a graphical mo del indicate
efficiency by omitting edges that do not corresp which
correspond vond
ariables dependin
to strong directly on other
interactions.
teractions. For
vexample,
ariables. itMany graphical models
is common to make the Marko aim
Markovto achiev e statistical and
v assumption that the graphical mo computational
model
del
efficiency by
should only con omitting edges that
tain edges from {y
contain (do
t− k)not
,...,ycorresp
( t− 1) ond to
( t)strong in teractions.
} to y , rather than containing F or
example,
edges fromitthe
is common to make
entire past historythe
history.. How Marko
ever,v in
However, assumption
some cases, that
we the graphical
believ
elieve mopast
e that all del
should should
inputs only conhavtain
have edges
e an from on
influence y the, .next . . , y element to of
y the , rather than containing
sequence. RNNs are
edges from the entire past history{ . How
useful when we believe that the distribution ov ever, in
over some(t) cases, we b elieve that all past
}
er y ma mayy dep end on a value of y(i)
depend
inputs should hav e an
from the distant past in a wa influence
way on the next element
y that is not captured by the effect of the sequence.
of y(i) RNNs
on y(t−are
1)
.
useful when we believe that the distribution over y may depend on a value of y
One way to in interpret
terpret an RNN as a graphical mo model
del is to view the RNN as
from the distant past in a way that is not captured by the effect of y on y .
defining a graphical mo model
del whose structure is the complete graph, able to represent
One
direct depway to interpret
dependencies
endencies betw an RNN
etween
een any pair as aofgraphical
y values. mo Thedel is to view
graphical mo the
del oRNN
model ver theas
defining a graphical mo del whose structure is the complete
y values with the complete graph structure is shown in Fig. 10.7. The complete graph, able to represent
direct dep
graph in endencies bof
interpretation
terpretation etw eenRNN
the any pair of y von
is based alues. The graphical
ignoring the hidden mounits
del ohver
(t) the
by
ymarginalizing
values with the complete graph
them out of the mo model. structure
del. is shown in Fig. 10.7 . The complete
graph interpretation of the RNN is based on ignoring the hidden units h by
It is more interesting to consider the graphical mo model
del structure of RNNs that
marginalizing them out of the model. ( t )
results from regarding the hidden units h as random variables. Including the
It is more interesting to consider the graphical model structure of RNNs that
The from
results conditional distribution
regarding over these
the hidden variables
units h as given their parents
random is deterministic.
variables. IncludingThis
theis
389
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

hidden units in the graphical mo modeldel reveals that the RNN provides a very efficient
parametrization of the joint distribution ov overer the observ
observations.
ations. SuppSuppose ose that we
hidden
represen
representedunits in the graphical
ted an arbitrary join jointt distribution over discrete values withery
mo del reveals that the RNN provides a v a efficient
tabular
parametrization
represen
representation—an of the joint distribution
tation—an array containing a separate entry for eacov er the observ ations.
each Supp ose
h possible assignment that we
represen ted an arbitrary join t distribution ov er discrete
of values, with the value of that entry giving the probability of that assignment v alues with a tabular
orepresen
ccurring. tation—an
If y can array
take on containing
k differen
different a tseparate
values, theentrytabular
for eacrepresentation
h possible assignment would
of
ha
havev alues, τ with the v alue of that entry giving the
ve O (k ) parameters. By comparison, due to parameter sharing, the number probability of that assignment
o ccurring.
of parameters If yincan
thetake
RNN oniskOdifferen
(1) as at function
values, the tabular representation
of sequence length. The num would
number ber
ha ve O (k ) parameters. By comparison,
of parameters in the RNN may be adjusted to control mo due to parameter model sharing, the
del capacity but is not n umber
of parameters
forced to scale in with RNN is O
thesequence (1) as aEq.
length. function
10.5 showsof sequence
that the length. The number
RNN parametrizes
of parameters
long-term in the RNN
relationships betw may
een bveariables
etween adjusted to control
efficiently
efficiently, , usingmodel capacity
recurrent but is not
applications
forced
of the tosamescalefunction
with sequence
f and same length.parameters
Eq. 10.5 shows θ at that eac
each h the
time RNNstep. parametrizes
Fig. 10.8
long-term relationships
illustrates the graphical model interb etw een v ariables
interpretation. efficiently
pretation. Incorp , using recurrent
orating the h no
Incorporating applications
( t) nodes
des in
of the
the same function
graphical mo
model f and same
del decouples parameters
the past θ at eac
and the future, h time
acting as step. Fig. 10.8
an intermediate
illustrates
quan
quantit
tit
tityy betwtheeen
etween graphical
them. Amodel variable inter
y (i)pretation.
in the distant Incorppastorating the h ano
may influence des in
variable
ythe
(t) graphical mo del h
via its effect on decouples. The structurethe pastofand thisthegraphfuture,
shows acting
thatasthe anmo intermediate
model del can be
quan
efficien tit
efficiently y b etw een them. A v ariable y in the distant
tly parametrized by using the same conditional probability distributionspast may influence a variable at
yeac
eachh via
timeits effect
step, on
and h .
that The
when structure
the v of
ariables this graph
are all shows
observ
observed, ed,that
the the mo
probability del can
of b
thee
efficien
join
joint tly parametrized
t assignment of all vby using can
ariables the bsame
e ev conditional
evaluated
aluated probability
efficiently
efficiently. . distributions at
each time step, and that when the variables are all observed, the probability of the
joinEvEven
en with theofefficien
t assignment efficient t parametrization
all variables of the graphical
can be evaluated efficiently mo
model,
. del, some op operations
erations
remain computationally challenging. For example, it is difficult to predict missing
values Evenin with the efficien
the middle t parametrization
of the sequence. of the graphical model, some operations
remain computationally challenging. For example, it is difficult to predict missing
The price recurrent netw networksorks pa pay y for their reduced num numb ber of parameters is
values in the middle of the sequence.
that the parameters may be difficult.
The price recurrent networks pay for their reduced number of parameters is
The parameter sharing used in recurrent netw networks
orks relies on the assumption
that the parameters may be difficult.
that the same parameters can b e used for different time steps. Equiv Equivalen alen
alently
tly
tly,, the
The parameter sharing used
assumption is that the conditional probabilit in recurrent
probability netw orks
y distribution ovrelies on
over the assumption
er the variables at
time t + 1 given the variables at time t is stationary, meaning thatEquiv
that the same parameters can b e used for different time steps. alently, the
the relationship
assumption
bet etween is that the conditional probabilit
ween the previous time step and the next time step do y distribution does ov er
es not dep the
depend v
end ariables
on t . In at
time t + 1 itgiven
principle, would thebevariables
possibleattotime use tt as
is stationary
an extra input , meaning at each thattime
the step
relationship
and let
b etween the
the learner discov previous
discover time
er any time-dep step and
time-dependence the next time step do es
endence while sharing as much as it can betw not dep end on t . een
etween In
principle,
differen
different it would
t time steps.be This
possible to use
would t
alreadyas an beextra
muchinputbetter at each
than timeusingstep and lett
a differen
different
the learner discov er any time-dep endence
conditional probability distribution for each t, but the netw while sharing as
networkm uch as it
ork would then havcan b etweeen
have to
differen
extrap
extrapolatet time
olate whensteps.
facedThis
with would
new valready
alues ofbte. much better than using a different
conditional probability distribution for each t, but the network would then have to
To complete our view of an RNN as a graphical mo model, del, we must describ describee how
extrapolate when faced with new values of t.
to dradraww samples from the mo model.
del. The main op operation
eration that we need to perform is
To complete our view of an RNN as a graphical model, we must describe how
perfectly
to draw legitimate,
samples from thoughtheit ismo
somewhat
del. The raremain
to design a graphical
operation model
that wewithneed such
todeterministic
perform is
hidden units.

390
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

simply to sample from the conditional distribution at each time step. How Howev ev
ever,
er,
there is one additional complication. The RNN must hav havee some mechanism for
simply to sample
determining from of
the length thethe conditional
sequence. distribution
This can be at eachedtime
achiev
achieved step. How
in various wa
ways.ever,
ys.
there is one additional complication. The RNN must have some mechanism for
In the case
determining thewhen
lengththe of output
the sequence.is a symsymbol
bol can
This tak
takenen
be from
achievaedvoincabulary
cabulary,
various, wa oneys.can
add a sp special
ecial symbol corresp
corresponding
onding to the end of a sequence (Schmidh Schmidhub ub
uberer, 2012).
When In that
the case when
symbol the output the
is generated, is asampling
symbol tak pro en
process
cessfrom a voIncabulary
stops. , one can
the training set,
we insert this symbol as an extra member of the sequence, immediately after x(τ )).
add a sp ecial symbol corresp onding to the end of a sequence ( Schmidh ub er , 2012
When
in eachthat symbol
training is generated, the sampling process stops. In the training set,
example.
we insert this symbol as an extra member of the sequence, immediately after x
Another
in each option
training is to in
example. intro
tro
troduce
duce an extra Bernoulli output to the mo model del that
represen
represents ts the decision to either con contintin
tinue
ue generation or halt generation at eac eachh
Another option is to in tro duce an extra Bernoulli
time step. This approach is more general than the approach of adding an extra output to the mo del that
represen
sym
symb bol tots the
the vdecision
ocabulary
cabulary, to, either
because conittinmauey generation
may be appliedor to halt
any generation
any RNN, rather at than
each
time step. This approach is more general than the
only RNNs that output a sequence of symbols. For example, it may be applied to approach of adding an extra
symRNN
an bol to theemits
that vocabulary
a sequence, becauseof real it nma y be applied
umbers. The new to output
any RNN, unit rather
is usuallythan a
only RNNs that output a sequence
sigmoid unit trained with the cross-entrop of symbols.
cross-entropy F or example,
y loss. In this approac it may
approach b e applied
h the sigmoid is to
trained to maximize the log-probability of the correct prediction asunit
an RNN that emits a sequence of real numbers. The new output is usually
to whether thea
sigmoid unit
sequence endstrained
or contin with
continues uesthe cross-entrop
at each time step.y loss. In this approach the sigmoid is
trained to maximize the log-probability of the correct prediction as to whether the
Another way to determine the sequence length τ is to add an extra output to
sequence ends or continues at each time step.
the mo del that predicts the integer τ itself. The mo
model modeldel can sample a value of τ
and then sample τ steps worth of data. This approachtorequires
Another w ay to determine the sequence length τ is add an adding
extra output
an extrato
the motodelthe
input that predicts
recurren
recurrent the integer
t update at each τ itself. The so
time step mothat
del can
the sample
recurren
recurrent atvupalue
updateof τis
date
aand
warethen sample τitsteps
of whether is nearworth the ofend data.
of theThis approach
generated requires This
sequence. adding extraan extra
input
input to the recurren t update at each time step
can either consist of the value of τ or can consist of τ − t, the num so that the recurren
number t up date
ber of remaining is
awaresteps.
time of whether
Withoutit isthis
nearextra the endinput, of the
the generated
RNN migh mightsequence.
t generateThis extra input
sequences that
can either
end abruptly consist of the v alue of τ or can consist of τ t,
abruptly,, such as a sentence that ends before it is complete. This approach isthe num ber of remaining
time steps.
based on theWithout
decomp
decompositionthis extra input, the RNN migh
osition − t generate sequences that
end abruptly, such as a sentence that ends before it is complete. This approach is
(1)
based on the decomp P (xosition, . . . , x (τ )) = P (τ )P (x(1) , . . . , x(τ ) | τ ). (10.27)

P (x , . . .τ, xdirectly
The strategy of predicting ) = Pis
(τ )used
P (x for, . example
..,x τ ). Goo
by dfellow(10.27)
Goodfellow et al.
(2014d). |
The strategy of predicting τ directly is used for example by Goodfellow et al.
(2014d).

In the previous section we describ


describeded how an RNN could corresp
correspond
ond to a directed
( t )
graphical model over a sequence of random variables y with no inputs x . Of
In the previous
course, section w
our developmen
development t eofdescrib
RNNsedashow an RNN
in Eq. 10.8 could corresp
included ond to aofdirected
a sequence inputs
x , x , . . . , x . In general, RNNs allow the extension of the graphical x
graphical
(1) (2) model( τ )over a sequence of random variables y with no inputs mo. del
Of
model
course,
view to our developmen
represen
represent t of aRNNs
t not only joint as in Eq. 10.8
distribution ovincluded
er the y avariables
over sequencebut
of inputs
also a
x , x , . . . , x . In general, RNNs allow the extension of the graphical model
view to represent not only a joint distribution 391 over the y variables but also a
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

conditional distribution over y giv en x. As discussed in the context of feedforw


given feedforward
ard
net
networks
works in Sec. 6.2.1.1, any mo model
del represen ting a variable P (y ; θ) can be reinter-
representing
conditional
preted as a distribution
mo
model over y giv
del representing en x. As discussed
a conditional in the
distribution P (context
y|ω ) with of feedforw
ω = θ. ard
We
net works in
can extend such a moSec. 6.2.1.1
model, any mo del represen ting a variable P ( y ; θ ) can
del to represent a distribution P ( y | x) by using the same b e reinter-
P ( y | ω) as before, but makinga ω
preted as a mo del representing conditional
a functiondistribution
of x. In the P (y
case with
ω ) of an ω = θ. this
RNN, We
can b
can extend
e acachievedsuchin
hieved a mo del to
differen
different represent
t wa
ways.
ys. We review here thePmost
a distribution ( y xcommon
|) by using andthe same
obvious
cPhoices.
( y ω) as before, but making ω a function of x. In the| case of an RNN, this
can b|e achieved in different ways. We review here the most common and obvious
Previously
Previously,, we hav havee discussed RNNs that tak takee a sequence of vectors x (t) for
choices.
t = 1 , . . . , τ as input. Another option is to tak takee only a single vector x as input.
When Previously , we havevector,
x is a fixed-size discussed RNNs
we can thatmake
simply take ita sequence of vectors
an extra input of thex RNN
for
, . . . , τ as the
t = 1generates
that input.sequence.
AnotherSomeoption is to tak
common waeysonly
ways a single van
of providing ector x as
extra input.
input to
When x
an RNN are: is a fixed-size vector, we can simply make it an extra input of the RNN
that generates the sequence. Some common ways of providing an extra input to
an 1.
RNNas an are:extra input at eac each
h time step, or
2.
1. as the stateath(0)
initialinput
an extra eac, hortime step, or
3.
2. b
both.
asoth.
the initial state h , or
The3.first and most common approac
both. approach h is illustrated in Fig. 10.9. The interaction
bet
etween
ween the input x and eac h hidden unit vector h (t) is parametrized by a newly
each
The
in
intro
tro first
troduced and most common
weightt matrix R that
duced weigh approacwashabsentis illustrated
from the in mo Fig.
del10.9
model . Thethe
of only interaction
sequence
betyween
of the The
values. inputsame
x and each hidden
product x> R is unit vector
added as h is parametrized
additional by ahidden
input to the newly
introduced
units weigh
at every t matrix
time step. WRe that was absent
can think of the from
choicethe of mox asdeldetermining
of only the the
sequence
value
of y v
> alues. The same product x R is added as additional
of x R that is effectively a new bias parameter used for each of the hidden units. input to the hidden
unitswat
The every
eights time step.
remain indep We can
independen
enden
endent t ofthink of the W
the input. choice
e canofthink x asofdetermining
this mo del the
model value
as taking
of xparameters
the R that is effectively a new bias parameter
θ of the non-conditional mo del used
model for eachthem
and turning of theinto
hidden units.
ω, where
The weights
the bias remain indep
parameters enden
within ω aret ofnow
the input. We can
a function of thethink of this model as taking
input.
the parameters θ of the non-conditional model and turning them into ω, where
Rather than receiving only a single vector x as input, the RNN ma mayy receive a
the bias parameters within ω are now a function of the input.
sequence of vectors x(t) as input. The RNN describ described ed in Eq. 10.8 corresp onds to a
corresponds
Rather than receiving only
(1) a single ( τ )vector
(1) x as input,
( τ ) the
conditional distribution P (y , . . . , y | x , . . . , x ) that makes a conditional RNN ma y receive a
sequence
indep
independenceof vectors
endence x as input.
assumption that thisThe RNN describ
distribution ed in Eq.as10.8 corresponds to a
factorizes
conditional distribution P (y , . . . , y x , . . . , x ) that makes a conditional
independence assumption that Pthis (y (distribution
t)
| x|(1), . . . , x (t)
).
factorizes as (10.28)
t
P (y x , . . . , x ). (10.28)
To remov
removee the conditional indep
independence
endence assumption, we can add connections from
the output at time t to the hidden unit| at time t + 1, as shown in Fig. 10.10. The
To remov
mo
model e the
del can thenconditional
represent indep endence
arbitrary assumption,
probability we can add
distributions ov erconnections
over from
the y sequence.
the output
This kind of time t to
at model the hidden aunit
representing at time t +ov
distribution , asa shown
1er
over in Fig.
sequence 10.10
given . The
another
model canstill
sequence thenhasrepresent arbitrary
one restriction, probability
which distributions
is that the length of bov the y sequence.
er sequences
oth must
This kind of model
be the same. We describrepresenting a
describee how to remodistribution
remove ov er a sequence
ve this restriction in Sec. 10.4. given another
sequence still has one restriction, which is that the length of both sequences must
be the same. We describe how to remo392 ve this restriction in Sec. 10.4.
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.9: An RNN that maps a fixed-length vector x in into


to a distribution ov
over
er sequences
. This RNN is appropriate for tasks such as image captioning, where a single image is
Figureas10.9:
used inputAntoRNN
a mo that
model mapsthen
del that a fixed-length
pro duces avector
produces x into
sequence of awords
distribution overthe
describing sequences
image.
. This
Eac
Each RNN
h elemen
element t yis appropriate for tasks
of the observed suchsequence
output as imageserves
captioning,
both aswhere
inputa(for
single
theimage is
current
used step)
time as input
and,toduring
a mo del that then
training, pro duces
as target (for athe
sequence
previousoftime
words describing the image.
step).
Each element y of the observed output sequence serves both as input (for the current
time step) and, during training, as target (for the previous time step).

393
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.10: A conditional recurrent neural netw network


ork mapping a variable-length sequence
of x values into a distribution ov over
er sequences of y values of the same length. Compared
Figure
to Fig. 10.10: A conditional
10.3, this RNN con recurrent
contains
tains neuralfrom
connections netwthe
ork previous
mappingoutput
a variable-length sequence
to the current state.
of x values into a distribution ov
These connections allow this RNN to moer sequences
model of y values of the same
del an arbitrary distribution ov length.
over Compared
er sequences of y
to Fig.
giv
given 10.3, thisofRNN
en sequences x ofcon
thetains
sameconnections
length. The from theofprevious
RNN output
Fig. 10.3 to able
is only the current state.
to represent
These connections
distributions allowthe
in which this
yv RNN
aluestoare
moconditionally
del an arbitrary distribution
indep
independen
enden
endentt fromover sequences
each of y
other given
given
the x sequences x
values. of of the same length. The RNN of Fig. 10.3 is only able to represent
distributions in which the y values are conditionally independent from each other given
the x values.

394
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.11: Computation of a typical bidirectional recurren recurrentt neural netnetw


work, meant
to learn to map input sequences x to target sequences y , with loss L at each step t.
Figure
The 10.11: Computation
h recurrence propagatesofinformation
a typical bidirectional
forward in time recurren
(tow tards
(towardsneural
the netwt)ork,
righ
right) meant
while the
to learn to map
g recurrence input sequences
propagates x tobac
information target
backward
kward sequences
in time (towy , with
ards loss
(towards L atThus
the left). eachatstep
eacht.
ointtht,recurrence
The
poin the outputpropagates
units o can information forward
benefit from in an
a relev
relevantime
ant (towardsofthe
t summary theright) in
past while
its hthe
g recurrence
input and from propagates
a relevanttinformation
relevan
an summary ofbac kward
the inin
future time
its g(towinput.
ards the left). Thus at each
point t, the output units o can benefit from a relevant summary of the past in its h
input and from a relevant summary of the future in its g input.

395
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.3 Bidirectional RNNs

10.3
All of theBidirectional
recurren
recurrentt netw orks RNNs
networks we hav
havee considered up to now ha have
ve a “causal” struc-
ture, meaning that the state at time t only captures information from the past,
All
x(1)of, . .the
. , xrecurren
(t−1)
, and t netw
the orks
presen
present wte input
have considered
x(t). Some up to now
of the mo
modelshave
dels wea ha“causal”
have struc-
ve discussed
ture, meaning that the state at time t only
also allow information from past y values to affect the current state when the captures information from the past,y
x , . .
values are av. , x ,
available. and
ailable. the presen t input x . Some of the mo dels we ha ve discussed
also allow information from past y values to affect the current state when the y
valuesHo
Howev wev
wever,
are er, in many applications we wan
available. wantt to output a prediction of y (t) whic which h
ma
may y dep
depend end on . For example, in sp speech
eech recognition,
Ho wev er, in many applications
the correct interpretation of the current sound as a phoneme mawe wan t to output a prediction
may y dep y whic
of end
depend on the h
may dep
next few end on
phonemes because of co-articulation .and For pexample,
oten tiallyinma
otentially spyeech
may evenrecognition,
dep
depend
end on
the correct interpretation
the next few words because of the linguistic depof the current sound as a
dependencies phoneme
endencies betw etweenma y dep end
een nearby words: on the if
next few
there are tw phonemes b ecause of
twoo interpretations of the curren co-articulation and p oten tially ma y
currentt word that are both acoustically plausible,even dep end on
the
we ma next
may y havfew w
havee to loords b
look ecause of the linguistic
ok far into the future (and dependencies
the past)betw een nearby words:
to disambiguate them. if
thereisare
This also twotrueinterpretations
of handwriting of the current word
recognition andthat
many areother
both sequence-to-sequence
acoustically plausible,
w e may hav
learning tasks,e todescrib
look ed
described far ininto
thethe nextfuture
section. (and the past) to disambiguate them.
This is also true of handwriting recognition and many other sequence-to-sequence
Bidirectional recurren recurrentt neural netw networks
orks (or bidirectional RNNs) were inv invented
ented
learning tasks, described in the next section.
to address that need (Sch Schuster
uster and Paliw
Paliwal al , 1997 ). They ha
haveve b een extremely suc-
Bidirectional
cessful (Gra Graves recurren t neural netw orks (or bidirectional
ves, 2012) in applications where that need arises, such as handwriting RNNs) w ere inv ented
to address
recognition (Grav that need
Graves es et(Schal.,uster
al. , 2008and ; Gra Paliw
Graves
ves andal, 1997 Sc ). They ha, ve
Schmidhuber
hmidhuber been
2009 ), sp extremely
speech
eech recogni- suc-
cessful
tion (Gra (Gra
Grav vesvesand , 2012 ) in applications
Schmidh
Schmidhub ub
uber er, 2005; Gra where
Grav ves et that
al.,,need
al. 2013arises, such as handwriting
) and bioinformatics (Baldi
recognition
et al.
al.,, 1999). ( Grav es et al. , 2008 ; Gra ves and Sc hmidhuber , 2009 ), sp eech recogni-
tion (Graves and Schmidhuber, 2005; Graves et al., 2013) and bioinformatics (Baldi
As the name suggests, bidirectional RNNs combine an RNN that mo mov ves forw
forward ard
et al., 1999).
through time beginning from the start of the sequence with another RNN that
mo
movesAs bac
ves the name suggests,
backward
kward through time bidirectional
beginning RNNs fromcombine
the endan of RNN that moves
the sequence. forw
Fig. 10.11ard
through time
illustrates the bteginning from the start
ypical bidirectional RNN, of withthe sequence
h (t) standingwithfor another
the stateRNN of that
the
moves bacthat
sub-RNN kward mov through
moves es forward time through
beginning timefrom andtheg (end
t) of the sequence.
standing for the state Fig.of10.11
the
illustrates
sub-RNN that mov the t ypical
moves es bacbidirectional RNN, with h standing
kward through time. This allows the output units o to
backward for the state of( t the
)

compute a representation thatthrough


sub-RNN that mov es forward dep
dependsendstimeon and g standing for the state of but the
sub-RNN
is most sensitive that mov toesthe
bacinput
kwardvthrough
alues around time. time Thistallows
, without the having
output to units
sp o to
specify
ecify a
compute awindo
fixed-size representation
window w around t that (as onedepw ends
ouldon hav
havee to do with a feedforward netw network, but
ork,
ais con
most sensitivenet
convolutional
volutional towork,
the input
network, values around
or a regular RNN with time a tfixed-size
, withoutlo having
look-ahead
ok-ahead to sp ecify a
buffer).
t
fixed-size window around (as one would have to do with a feedforward network,
a conThis idea can
volutional netbwork,
e naturally extended
or a regular RNN to with2-dimensional
a fixed-sizeinput, such asbuffer).
look-ahead images,
by having RNNs, each one going in one of the four directions: up, do down, wn,
left,Thisrigh
right. t.idea
At caneachbpeoint naturally
(i, j) ofextended
a 2-D grid, to 2-dimensional
an output O i,j input, such compute
could then as images, a
b y having
represen
representation RNNs, each
tation that would capture mostly lo one going in one
local of the four directions:
cal information but could also dep up, do
depend wn,
end
left, righ t. At each p oint ( i, j) of a 2-D grid,
on long-range inputs, if the RNN is able to learn to carry that information. an output O could then compute a
representation
Compared to athatconv would
convolutional
olutional capture
netw
network,mostly
ork, RNNs localapplied
information
to imagesbut are
could also dep
typically moreend
on long-range inputs, if the RNN is able to learn to carry that information.
Compared to a convolutional network, RNNs 396 applied to images are typically more
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

exp
expensiv
ensiv
ensivee but allow for long-range lateral in interactions
teractions betw
betweeneen features in the
same feature map (Visin et al., 2015; Kalc Kalchbrenner
hbrenner et al.
al.,, 2015). Indeed, the
exp
forwensiv
forward e but allow equations
ard propagation for long-range lateral
for such RNNsinmay
teractions betwineen
be written features
a form that in the
shows
same use
they feature map
a con
conv (Visin
volution et computes
that al., 2015; the
Kalcbhbrenner
ottom-upetinput
al., 2015 ). Indeed,
to each lay
layer, the
er, prior
forw ard propagation
to the recurren equations for such RNNs may b e written
recurrentt propagation across the feature map that incorp in a form
incorporates that shows
orates the lateral
they
in use a convolution that computes the bottom-up input to each layer, prior
interactions.
teractions.
to the recurrent propagation across the feature map that incorporates the lateral
interactions.
10.4 Enco
Encoder-Deco
der-Deco
der-Decoder der Sequence-to-Sequence Arc Architec-
hitec-
tures
10.4 Encoder-Deco der Sequence-to-Sequence Architec-
We ha
have tures
ve seen in Fig. 10.5 how an RNN can map an input sequence to a fixed-size
vector. We hav havee seen in Fig. 10.9 how an RNN can map a fixed-size vector to a
Wsequence. We in
e ha ve seen havFig.
have e seen10.5inhow
Fig.an10.3RNN can10.4
, Fig. map an input
, Fig. 10.10sequence
and Fig. to a fixed-size
10.11 how an
vRNN
ector.can Wemap havan
e seen
input sequence to an output sequence of the same length. to a
in Fig. 10.9 how an RNN can map a fixed-size vector
sequence. We have seen in Fig. 10.3, Fig. 10.4, Fig. 10.10 and Fig. 10.11 how an
Here we discuss how an RNN can be trained to map an input sequence to an
RNN can map an input sequence to an output sequence of the same length.
output sequence whic which h is not necessarily of the same length. This comes up in
man
many Here
y applications, such an
we discuss how as RNN
speechcan be trained machine
recognition, to map an input sequence
translation to an
or question
output
answ
answering, sequence whic h is not necessarily of the same length.
ering, where the input and output sequences in the training set are generally This comes up in
manof
not y applications,
the same length such as speech
(although recognition,
their lengths might machine translation or question
be related).
answering, where the input and output sequences in the training set are generally
We often call the input to the RNN the “con “context.”
text.” We wan antt to proproduce
duce a
not of the same length (although their lengths might be related).
represen
representation
tation of this context, C . The context C migh mightt be a vector or sequence of
vectors that summarize the input sequence X = (xtext.”
W e often call the input to the RNN the “con (1) , . . . ,Wxe(n w)an
). t to produce a
representation of this context, C . The context C might be a vector or sequence of
The simplest RNN architecture for mapping a variable-length sequence to an-
vectors that summarize the input sequence X = (x , . . . , x ).
other variable-length sequence was first prop proposed
osed by Cho et al. (2014a) and shortly
afterThe simplest
by Sutsk
Sutskevev erRNN
ever et al.architecture
(2014), who for mapping
indep
independen
enden
endentlytlya dev
variable-length
develop
elop
eloped ed that arc sequence
architecture
hitecture toand
an-
other v ariable-length sequence was first prop osed b y Cho
were the first to obtain state-of-the-art translation using this approach. The former et al. (2014a ) and shortly
after by isSutsk
system everonetscoring
based al. (2014proposals
), who indep endentlyby
generated dev eloped that
another machine architecture and
translation
w ere the while
system, first to obtain
the latterstate-of-the-art
uses a standalone translation
recurrent using
netw this
network ork approach.
to generateThe theformer
trans-
system is based on
lations. These authors resp scoring proposals
respectively generated
ectively called this arc by another
architecture, machine
hitecture, illustrated in Fig. 10.12,translation
system,
the enco while
encoder-deco
der-decothe
der-decoder latter uses a standalone recurrent
der or sequence-to-sequence architecture. network Theto idea
generate
is verythesimple:
trans-
lations.
(1) an enc These
enco oder authors
or readerresp or ectively
input RNN called this
pro architecture,
processes
cesses the input illustrated
sequence. in The Fig. 10.12
enco der,
encoder
the enco
emits theder-deco
context derC ,orusually
sequence-to-sequence
as a simple function architecture.
of its final Thehidden
idea is state.
very simple:
(2) a
(1)
de an enc o der or reader or input RNN pro cesses the input
deccoder or writer or output RNN is conditioned on that fixed-length vector (just lik sequence. The enco dere
like
emits the context C , usually as a simple
in Fig. 10.9) to generate the output sequence Y = ((y function of its final hidden
y(1), . . . , y (n )). The inno state.
innov vationa
(2)
decthis
of oderkind
or writer or output RNN
of architecture is conditioned
over those presentedon inthat
earlierfixed-length
sections ofvector (just likise
this chapter
in Fig.
that the10.9 ) to generate
lengths n x and the output
ny can varysequence
from each Y = (y while
other, , . . . , yprevious
). The arc inno vation
architectures
hitectures
of this kind of
constrained nxarchitecture
= ny = τ . In over those presented in earlier
a sequence-to-sequence sections ofthe
architecture, thistw chapter
twoo RNNs is
that the lengths n and n can vary from each other, while previous architectures
constrained n = n = τ . In a sequence-to-sequence architecture, the two RNNs
397
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.12: Example of an enco encoder-deco


der-deco
der-decoder
der or sequence-to-sequence RNN architecture,
for learning to generate an output sequence ( ,..., ) giv
given
en an input sequence
(Figure
, 10.12:
,..., Example of an enco
). It is comp
composedder-deco der
osed of an enco or
encodersequence-to-sequence
der RNN that reads the RNN architecture,
input sequence
for learning
and a deco dertoRNN
decoder generate an outputthe
that generates sequence ( , . . . ,
output sequence (or computes ) given the
an input sequence
probabilit
probability
y of a
(given, output
given ). It The
, . . . sequence).
, is comp
finalosed of an
hidden encoofder
state theRNN enco thatRNN
encoder
der readsisthe input
used sequencea
to compute
and a decofixed-size
generally der RNNcon that
contextgenerates
text variablethe output
C whic
which sequence a(or
h represents computes
semantic the probabilit
summary of the yinput
of a
given output
sequence and sequence).
is given as The
inputfinal hidden
to the decostate
decoder of the encoder RNN is used to compute a
der RNN.
generally fixed-size context variable C which represents a semantic summary of the input
sequence and is given as input to the decoder RNN.

398
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

are trained join


jointly
tly to maximize the av erage of log P (y (1) , . . . , y(n ) | x (1), . . . , x (n ))
average
over all the pairs of x and y sequences in the training set. The last state hn of
are trained
the encoderjoin tly to
RNN is maximize
typically the
used avas
erage of log P (y , .C
a representation . . ,of
y the input
x , . sequence
..,x )
over is
that allprovided x
the pairsasofinput y
andto thesequences
deco derinRNN.
decoder the training set. The last | state h of
the encoder RNN is typically used as a representation C of the input sequence
If the context C is a vector, then the deco decoder
der RNN is simply a vector-to-
that is provided as input to the decoder RNN.
sequence RNN as describ describeded in Sec. 10.2.4. As we ha have
ve seen, there are at least tw twoo
If the context C is a vector, then the deco der RNN is
ways for a vector-to-sequence RNN to receive input. The input can be provided as simply a vector-to-
sequence
the initialRNN
stateas of describ
the RNN,ed inorSec.
the 10.2.4
input .can
As bwee connected
have seen, to there
the are
hiddenat least
unitstwato
ways
eac
each for a vector-to-sequence
h time step. These tw twoo wa RNN
ways to receive input.
ys can also be combined. The input can b e provided as
the initial state of the RNN, or the input can be connected to the hidden units at
eachThere
time isstep.
no constraint
These twothat ways thecan
enco
encoder
derbmust
also hav
havee the same size of hidden lay
e combined. layer
er
as the deco
decoder.
der.
There is no constraint that the encoder must have the same size of hidden layer
One clear limitation of this architecture is when the context C output by the
as the decoder.
enco der RNN has a dimension that is too small to properly summarize a long
encoder
sequence.clear
One Thislimitation
phenomenon of this
wasarchitecture
observ
observed ed by is when theetcontext
Bahdanau al. (2015 C)output by the
in the context
enco
of der
mac
machine
hineRNN has a dimension
translation. They prop that
osedistotoo
proposed mak
makesmall
e C a to properly summarize
variable-length sequence aratherlong
sequence. This phenomenon
than a fixed-size vector. A was observ
Additionally
dditionally ed by
dditionally,, they in Bahdanau
intro
tro
troduced et al. ( 2015
duced an attention me ) in the context
mechanism
chanism
of mac hine translation.
that learns to asso associate They
ciate elemenprop
elements osed to mak e
ts of the sequence C a variable-length
C to elemen
elements sequence
ts of the outputrather
than a fixed-size
sequence. See Sec.vector.
12.4.5.1Additionally , they introduced an attention mechanism
for more details.
that learns to associate elements of the sequence C to elements of the output
sequence. See Sec. 12.4.5.1 for more details.
10.5 Deep Recurren
Recurrentt Net
Networks
works

10.5
The Deep inRecurren
computation most RNNstcan Net worksosed into three blo
be decomp
decomposed bloccks of parameters
and asso
associated
ciated transformations:
The computation in most RNNs can be decomposed into three blocks of parameters
and1.asso
fromciated transformations:
the input to the hidden state,
2.
1. from the previous hidden
input to the statestate,
hidden to the next hidden state, and
3.
2. from the hidden
previousstate to the
hidden output.
state to the next hidden state, and

With3. the
fromRNN
the hidden
arc state to
architecture
hitecture of the
Fig.output.
10.3, each of these three blobloccks is asso
associated
ciated
with a single weigh
weightt matrix. In other words, when the net network
work is unfolded, each
With thecorresponds
of these RNN architecture of Fig.transformation.
to a shallow 10.3, each of these
By three blocks
a shallow is associated
transformation,
with
w a single
e mean weight matrix.that
a transformation In other
would words, when the net
be represented bywork is unfolded,
a single la
lay each
yer within
of these
a deep MLPcorresponds to a shallow transformation. By a shallow transformation,
MLP.. Typically this is a transformation represented by a learned affine
w e mean a transformation
transformation follow ed by athat
followed fixedwould be represented
nonlinearit
nonlinearity y. by a single layer within
a deep MLP. Typically this is a transformation represented by a learned affine
Would it be advadvan
an
antageous
tageous to introduce depth in each of these operations?
transformation followed by a fixed nonlinearity.
Exp
Experimen
erimen
erimental
tal evidence (Grav
Graves
es et al.
al.,, 2013; Pascan
ascanu
u et al.
al.,, 2014a) strongly suggests
W ould
so. The exp it be
experimen
erimenadv
erimental an tageous to introduce
agreementt with the each
tal evidence is in agreemen depth in of these
idea that operations?
we need enough
Experimental evidence (Graves et al., 2013; Pascanu et al., 2014a) strongly suggests
so. The experimental evidence is in agreemen 399 t with the idea that we need enough
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.13: A recurrent neural netw network


ork can b e made deep in man many y ways (Pascan
Pascanuu
, 2014a). The hidden recurrent state can be brok broken
en down ininto
to groups organized
Figure
hierarc 10.13:. A recurrent
hierarchically
hically
hically. Deep
Deeper neural netw(e.g.,
er computation ork can b e made
an MLP) can deep in many w
be introduced inays
the(input-to-
Pascanu
, 2014a ). The hidden recurrent state can b e
hidden, hidden-to-hidden and hidden-to-output parts. This mabroken down
may into groups organized
y lengthen the shortest
hierarc
path hically.different
linking Deeptime
er computation
steps. (e.g.,path-lengthening
The an MLP) can beeffectintroduced
can b einmitigated
the input-to-
by
hidden,
in
intro
tro hidden-to-hidden
troducing
ducing and hidden-to-output parts. This may lengthen the shortest
skip connections.
path linking different time steps. The path-lengthening effect can b e mitigated by
introducing skip connections.

400
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

depth in order to perform the required mappings. See also Schmidh Schmidhub ub
uberer (1992),
El Hihi and Bengio (1996), or Jaeger (2007a) for earlier work on deep RNNs.
depth in order to perform the required mappings. See also Schmidhuber (1992),
Gra
Gravesves et al. (2013) were the first to show a significant benefit of decomp decomposingosing
El Hihi and Bengio (1996), or Jaeger (2007a) for earlier work on deep RNNs.
the state of an RNN in into
to multiple la layyers as in Fig. 10.13 (left). We can think
Gra
of the low ves
lower et
er la al.
lay ( 2013 ) were
yers in the hierarc the
hierarch first to show ainsignificant
hy depicted Fig. 10.13baenefit as pla ofying
decomp
playing osing
a role in
the state of an
transforming theRNNra
raww in to multiple
input in
into layers as in Fig.
to a representation 10.13
that (left).appropriate,
is more We can think at
of the
the lowerlev
higher laels
levelsyersofinthe thehidden
hierarcstate.
hy depicted
Pascan
ascanuin Fig.
u et al.10.13
(2014aa as
) goplaaying
stepa further
role in
transforming
and prop
proposeose tothe hav
haveraewa separate
input into MLPa representation
(p
(possibly
ossibly deep) that
for is
eacmore
each appropriate,
h of the three blo
blocksat
cks
the
en higher
enumerated
umerated ab lev
abovels
ov
ove,of the hidden state. P ascan u et al. ( 2014a ) go
e, as illustrated in Fig. 10.13b. Considerations of representational a step further
and prop
capacit
capacity ose to hav
y suggest to ealloa cate
separate
allocate enoughMLP (possibly
capacity deep)
in each of for
theseeacthree
h of the three
steps, butblo cks
doing
enumerated
so by addingab ove, as
depth may illustrated in Fig.by10.13
hurt learning b. Considerations
making of representational
optimization difficult. In general,
capacit y suggest to allo
it is easier to optimize shallo cate enough
wer architectures, and adding the extra but
shallower capacity in each of these three steps, doing
depth of
so by adding depth may hurt learning by making optimization
Fig. 10.13b makes the shortest path from a variable in time step t to a variable difficult. In general,
it is
in easier
time steptot + optimize
1 become shallo wer architectures,
longer. For example, ifand
For an adding
MLP with the aextra
singledepth
hidden of
Fig.
la yer 10.13
layer is usedb makes
for thethe shortest path
state-to-state from a variable
transition, haveeindoubled
we hav time step thet length
to a variable
of the
in time step
shortest path b et t + 1 b
etween ecome longer.
ween variables in any tw For example, if an MLP with a single
twoo different time steps, compared with the hidden
la yer is used for the
ordinary RNN of Fig. 10.3. Ho state-to-state
How transition,
wever, as argued we bhav e doubled
y Pascan
Pascanu u et the
al. length
(2014a), of this
the
shortest
can path b etween
be mitigated by in variables
intro
tro ducinginskip
troducing anyconnections
two differentin time steps, compared with
the hidden-to-hidden path,theas
ordinary RNN of
illustrated in Fig. 10.13c.Fig. 10.3 . Ho w ever, as argued b y Pascan u et al. ( 2014a ), this
can be mitigated by introducing skip connections in the hidden-to-hidden path, as
illustrated in Fig. 10.13c.
10.6 Recursiv
Recursive
e Neural Net
Networks
works

10.6 eRecursiv
Recursiv
Recursive neural net e Neural
networks
works represen
represent Net
t yetworks
another generalization of recurrent net-
works, with a different kind of computational graph, which is structured as a deep
Recursiv
tree, e neural
rather than net
theworks represen
chain-lik
chain-like t yet another
e structure of RNNs. generalization
The typicalofcomputational
recurrent net-
works, with a different
graph for a recursive net kind
work is illustrated in Fig. 10.14. Recursive neuralas
network of computational graph, which is structured netaworks
deep
networks
tree,
were in rather
intro
tro
troducedthanbythe
duced chain-lik
Pollack (1990 e structure
) and their of pRNNs.
otentialThe
use tfor
ypical computational
learning to reason
graph for a recursive net work is illustrated
was described by by Bottou (2011). Recursiv in Fig. 10.14
Recursivee net .
networksRecursive
works ha have neural networks
ve been successfully
were intro
applied toduced
pro by Pollack (1990) and their
processing
cessing potential
as input use for
to neural netslearning to reason
(Frasconi et al.
al.,,
was described by by Bottou
1997, 1998), in natural language pro (2011 ). Recursiv
processing
cessing (So e net
Socher works
cher et al. ha ve been successfully
al.,, 2011a,c, 2013a) as well
applied
as to processing
in computer vision (SoSocher
cher et al., 2011b as).input to neural nets (Frasconi et al.,
1997, 1998), in natural language processing (Socher et al., 2011a,c, 2013a) as well
as inOne clear adv
computer advantage
antage
vision ofcher
(So recursive
et al.,nets over
2011b ). recurren
recurrentt nets is that for a sequence
of the same length τ , the depth (measured as the num numb ber of comp
compositions
ositions of
One
nonlinear opclear adv antage of recursive nets o ver recurren t nets is
erations) can be drastically reduced from τ to O ( log τ ), whic
operations) that for a sequence
whichh might
of the
help dealsamewithlength τ , the
long-term depth (measured
dependencies. An op as question
open
en the numb iserhowof to
compbestositions of
structure
nonlinear
the tree. Oneop erations)
option iscan be edrastically
to hav
have reduced
a tree structure fromdo
which τes
does tonotO ( log
dep τend
depend), whic h might
on the data,
help deal with long-term dependencies. An open question is how to best structure
the We
tree.suggest
One to not abbreviate
option is to have“recursive neural network”
a tree structure whichasdo“RNN”
es nottodep avoid
endconfusion with
on the data,
“recurrent neural network.”

401
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.14: A recursive netw


network
ork has a computational graph that generalizes that of the
recurren
recurrentt netw
network
ork from a chain to a tree. A variable-size sequence x , x , . . . , x can
Figure
b e mapp10.14:
mappeded toAa recursive
fixed-sizenetw ork has a computational
representation (the output ograph
), withthat generalizes
a fixed that of the
set of parameters
recurren
(the t netw
weigh
weight ork from
t matrices U , aV c,hain
W ).toThe
a tree.
figureAillustrates
variable-size sequence
a sup
supervised x , x , .
ervised learning case. . , in
x which
can
be mapp
some ed to
target a fixed-size
y is representation
provided which is asso (the with
associated
ciated output o), withsequence.
the whole a fixed set of parameters
(the weight matrices U , V , W ). The figure illustrates a supervised learning case in which
some target y is provided which is associated with the whole sequence.

402
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

suc
suchh as a balanced binary tree. In some application domains, external metho methods ds
can suggest the appropriate tree structure. For example, when pro processing
cessing natural
such as a sentences,
language balanced binary
the treetree. In some
structure forapplication
the recursive domains,
netw
network external
ork can bemetho
fixed dsto
can suggest the appropriate tree structure. For example,
the structure of the parse tree of the sentence provided by a natural language when pro cessing natural
language
parser (So sentences,
cher et al.the
Socher tree, 2013a
, 2011a structure for the
). Ideally
Ideally, recursive
, one would netw ork learner
like the can be itself
fixed toto
the
discostructure
discover of the
ver and infer theparse
tree tree of thethat
structure sentence provided for
is appropriate by an
a y
anynatural
giv
given language
en input, as
parser ( So cher et al.
suggested by Bottou (2011). , 2011a , 2013a ). Ideally , one w ould like the learner itself to
discover and infer the tree structure that is appropriate for any given input, as
Man
Many y varian
ariants
ts of the recursiv
recursivee net idea are possible. For example, Frasconi
suggested by Bottou (2011).
et al. (1997) and Frasconi et al. (1998) asso associate
ciate the data with a tree structure,
andManasso yciate
varian
associate ts of
the the recursiv
inputs and targetse net withidea are possible.nodes
individual For example,
of the tree.Frasconi
The
et al. ( 1997 ) and F rasconi
computation performed by eac et
eachal.
h no (
node1998
de do )
does asso ciate
es not hav the data with a tree structure,
havee to be the traditional artificial
neuron computation (affine transformation of all inputs nodes
and asso ciate the inputs and targets with individual follow
followed of by
ed thea tree. The
monotone
computation
nonlinearit
nonlinearity). p erformed by
y). For example, So eac h
Socherno de do es not
cher et al. (2013a) prophav e to
propose b e the traditional
ose using tensor op artificial
operations
erations
neuron
and computation
bilinear forms, which(affineha
havetransformation
ve previously been of found
all inputsusefulfollow eddel
to mo
modelby relationships
a monotone
nonlinearit
b et
etween y). For(example,
ween concepts Weston etSoal. cher
al.,
, 2010 et ;al. (2013a
Bordes et) al.
prop
al., ose )using
, 2012 whentensor operations
the concepts are
and bilinear
represen
represented forms,
ted by con which
continuous ha ve previously
tinuous vectors (embeddings). b een found useful to mo del relationships
between concepts (Weston et al., 2010; Bordes et al., 2012) when the concepts are
represented by continuous vectors (embeddings).
10.7 The Challenge of Long-T
Long-Term
erm Dep
Dependencies
endencies

10.7
The The Challenge
mathematical challenge of of Long-T
learning ermdep
long-term Dep endencies
dependencies
endencies in recurrent net-
works was inintro
tro
troduced
duced in Sec. 8.2.5. The basic problem is that gradients propagated
The
over mathematical
man
many y stages tendchallenge
to eitherof learning long-term
vanish (most of thedep endencies
time) or exploin de
exploderecurrent
(rarely,, net-
(rarely but
w orks was
with muc
much in tro duced in Sec. 8.2.5
h damage to the optimization). Ev. The basic
Even problem is that gradients
en if we assume that the parameters arepropagated
over
suc
such man y stages tend
h that the recurrent netwto either
networkork is stable (can of
v anish (most the memories,
store time) or explowithdegradients
(rarely, but
not
with
explomuc h
exploding), damage to the optimization).
ding), the difficulty with long-term dep Ev en if we
dependencies assume that the
endencies arises from the exp parameters are
exponentially
onentially
such that
smaller the ts
weigh
eightsrecurrent
giv en tonetw
given ork is stable
long-term (can store
interactions (invmemories,
olving thewith
(involving gradients not
multiplication of
explo
man
many ding), the difficulty with long-term
y Jacobians) compared to short-term ones. Man dep endencies Manyarises from the exp onentially
y other sources provide a
smaller
deep
deeper w eigh ts
er treatment (Hogiv en to
Hochreiterlong-term
chreiter, 1991; Doy interactions (inv olving
Doyaa, 1993; Bengio et al. al.,,the multiplication
1994 ; Pascan
Pascanuu et al.of,
al.,
man
2013ay Jacobians) compared
2013a)) . In this section, we describdescribee the problem in more detail. The remaininga
to short-term ones. Man y other sources provide
deeper treatment
sections describ (Hochreiter,to
describee approaches 1991 ; Doya, 1993
overcoming the; Bengio
problem. et al., 1994; Pascanu et al.,
2013a) . In this section, we describe the problem in more detail. The remaining
Recurren
Recurrent
sections t netw
describ networks
orks in
inv
e approaches volve
to othe composition
vercoming of the same function multiple
the problem.
times, once per time step. These comp compositions
ositions can result in extremely nonlinear
behaRecurren
ehavior, t netw orks in volve
vior, as illustrated in Fig. 10.15. the composition of the same function multiple
times, once per time step. These compositions can result in extremely nonlinear
In particular, the function composition employ employed ed by recurrent neural netw networks
orks
behavior, as illustrated in Fig. 10.15.
somewhat resembles matrix multiplication. We can think of the recurrence relation
In particular, the function composition employed by recurrent neural networks
somewhat resembles matrix multiplication.h(t) = W >h (t−1)
We can think of the recurrence (10.29) relation
as a very simple recurrent neural
h netw
network
= ork
W lacking
h a nonlinear activ
activation
ation function,
(10.29)
403 lacking a nonlinear activation function,
as a very simple recurrent neural network
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.15: When comp composing


osing many nonlinear functions (like the linear-linear-tanh
tanh la
layer
yer shown
here), the result is highly nonlinear, typically with most of the values associated with a tiny
Figure
deriv 10.15:
derivative,
ative, someWhen comp
values osing
with manyderiv
a large nonlinear
derivativ
ativ
ative, functions
e, and many(like the linear-
alternations tanh
bet
etween
weenlayer shown
increasing
here),
and the result is
decreasing. Inhighly nonlinear,
this plot, we plottypically
a linear with
pro most of
projection
jection ofthe values associatedhidden
a 100-dimensional with astate
tiny
deriv
do wnative,
down to a some
singlevdimension,
alues with aplotted
large deriv ative,
on the and many
y-axis. The xalternations
-axis is thebco
etween increasing
coordinate
ordinate of the
and decreasing.
initial state alongInathis plot,direction
random we plot ainlinear pro jection of a 100-dimensional
the 100-dimensional space. We can th hidden
us viewstate
thus this
downastoa alinear
plot single dimension, of
cross-section plotted y
on the -axis.function.
a high-dimensional x
The -axisTheisplots
the show
co ordinate of the
the function
initialeach
after statetime
along a random
step, or equivdirection
alently,, in
equivalently
alently the each
after 100-dimensional
num
number space.the
ber of times Wetransition
can thus view this
function
plot as a linear
has been comp cross-section
composed.
osed. of a high-dimensional function. The plots show the function
after each time step, or equivalently, after each number of times the transition function
has been composed.

404
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

and lacking inputs x. As describ


describeded in Sec. 8.2.5, this recurrence relation essen
essentially
tially
describ
describes
es the p ower metho
method.
d. It may be simplified to
and lacking inputs x. As described in Sec. 8.2.5, this recurrence relation essentially
describes the p ower method. Ithmay t > (0) to
(t)
=beWsimplified
h , (10.30)

h osition
and if W admits an eigendecomp = W
eigendecomposition h , (10.30)

and if W admits an eigendecompW = Q Q>,


osition (10.31)

the recurrence may be simplified W = QtoQ ,


further (10.31)

the recurrence may be simplified )


= Q > tot Qh(0).
h (tfurther (10.32)

The eigenv
eigenvalues
alues are raised to the h p= Q of tQh
ower . eigenv
causing eigenvalues
alues with magnitude (10.32)
less than one to decay to zero and eigenv eigenvalues
alues with magnitude greater than one to
The
explo eigenv
explode. alues
de. Any comp are
componentraised to the
(0)
onent of h that is notp o wer of t aligned
causing witheigenv alues
the witheigenv
largest magnitude
eigenvector
ector
less than
will even one
eventually to decay
tually be discarded.to zero and eigenv alues with magnitude greater than one to
explode. Any component of h that is not aligned with the largest eigenvector
will This problem
eventually be is particular to recurrent netw
discarded. networks.
orks. In the scalar case, imagine
multiplying a weighweightt w by itself many times. The pro duct wt will either vanish or
product
exploThis
explode problem
de dep endingis on
depending particular
the magnitude to recurrent
of w. netw
How
Howev orks.
ev er, ifInwe
ever, themake
scalar case, imaginet
a non-recurren
non-recurrent
m
netultiplying
netwo
wo
work a weigh t w b y
rk that has a different weigh itself many times.
weightt w (t) at eac each The pro duct w
h time step, the situation isvdifferent.
will either anish or
explo
If the de depending
initial state is on thebymagnitude
given 1, then theofstate w. How ever,
at time t isifgiven
we makeby at w non-recurren
(t). Supp
Supposeoset
netwothe
that rk that
w(t) vhas
aluesa different
are generated w at eac
weightrandomly
randomly, h time
, indep step, the
independently
endently situation
from is different.
one another, with
If the initial state is given b y 1 , then
zero mean and variance v . The variance of the pro the state at time t is given b y
n w
duct is O (v ). To obtain some
product . Supp ose
that the w v alues are generated randomly , indep endently from one another, √
with
desired variance v ∗ we may cho hoose ose the individual weights with variance v = v ∗.
zero mean and
Very deep feedforw variance
feedforward ard netv . worksvwith
The
networks ariance of the pro
carefully chosen is O (v can
ductscaling ). Tth
o us
thusobtain
av oidsome
avoid the
vdesired
anishing variance
and explo v w
exploding e may
ding chooseproblem,
gradient the individual
as argued weights (2014).v = √ v .
with variance
by Sussillo
Very deep feedforward networks with carefully chosen scaling can thus avoid the
The vanishing
vanishing and exploand dingexplo
exploding
ding problem,
gradient gradien
gradientt problem
as arguedforbyRNNs Sussillo was indep
independently
(2014 ). endently
disco
discovered
vered by separate researchers (Ho Hochreiter
chreiter, 1991; Bengio et al., 1993, 1994).
The
One may hopv anishing and explo ding
hopee that the problem can be av gradien t problem
avoided
oided simply for RNNsby sta was
yingindep
staying endently
in a region of
discovered space
parameter by separate
where the researchers
gradients(Ho dochreiter , 1991
not vanish or; explo
Bengio
explode.de.etUnfortunately
al., 1993, 1994
Unfortunately, , in).
One may
order hopememories
to store that the problem
in a waway y can
thatbise av oidedtosimply
robust small by staying in a the
perturbations, region
RNN of
parameter space where the gradients do not v anish or explo
must enter a region of parameter space where gradients vanish (Bengio et al., 1993, de. Unfortunately , in
order). to
1994
1994). Sp store memories
Specifically
ecifically
ecifically, , wheneverin athewaymo that
del is
model robust
is able to to small plong
represent erturbations,
term dep the RNN
dependencies,
endencies,
mustgradien
the enter at of
gradient region
a longof parameter
term interaction space where
has exp gradients
exponen
onen
onentially
tially vanish
smaller(Bengio et al., 1993
magnitude than,
1994 ). Specifically
the gradient of a ,short
whenever
termthe in mo del is able
interaction.
teraction. to represent
It does not mean long term
that it dep endencies,
is impossible
thelearn,
to gradienbutt of a long
that termtake
it might interaction has exp
a very long time onen
to tially
learn smaller
long-term magnitude
dep
dependencies, than
endencies,
the gradient of
because the signal ab a short
about term
out these depin teraction.
dependencies It does not mean that it
endencies will tend to be hidden by the smallest is impossible
to learn, but arising
fluctuations that it from
mightshort-term
take a verydep long time to learn
dependencies.
endencies. long-term
In practice, thedepexp endencies,
experiments
eriments
b ecause the signal ab out these dep endencies will tend
in Bengio et al. (1994) show that as we increase the span of the dependenciesto b e hidden by the smallest
that
fluctuations arising from short-term dependencies. In practice, the experiments
in Bengio et al. (1994) show that as we 405 increase the span of the dependencies that
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

need to be captured, gradient-based optimization becomes increasingly difficult,


with the probability of successful training of a traditional RNN via SGD rapidly
need
reac to b0e for
reaching
hing captured, gradient-based
sequences of only length optimization
10 or 20. becomes increasingly difficult,
with the probability of successful training of a traditional RNN via SGD rapidly
reacFhing
or a 0deep
deeper
for er treatment
sequences of thelength
of only dynamical
10 or systems
20. view of recurrent netw
networks,
orks,
see Doy
Doya a (1993), Bengio et al. (1994) and Siegelmann and Sontag (1995), with a
For in
review a deep er treatment
Pascan
Pascanu of the
u et al. (2013a dynamical
). The systems
remaining viewofofthis
sections recurrent netw
chapter orks,
discuss
vsee Doyaapproac
arious (1993),hes
approaches Bengio et al.
that hav
havee b(een
1994proposed
) and Siegelmann
to reduceand
theSontag
difficult(1995
difficultyy of ), with a
learning
review in Pascan
long-term u et al. (in
dependencies (2013a
some ). The
casesremaining
allo
allowing
wing ansections
RNN ofto this chapter
learn dep discuss
dependencies
endencies
vacross
arious happroac hes that hav e b een proposed to reduce the
undreds of steps), but the problem of learning long-term dep difficult y of learning
dependencies
endencies
long-term
remains onedependencies
of the main (in some cases
challenges allolearning.
in deep wing an RNN to learn dependencies
across hundreds of steps), but the problem of learning long-term dependencies
remains one of the main challenges in deep learning.
10.8 Ec
Echo
ho State Net
Networks
works

10.8
The Echo
recurrent State
weigh
weights Netfrom
ts mapping works h(t−1) to h(t) and the input weigh weightsts mapping
from x(t) to h(t) are some of the most difficult parameters to learn in a recurrent
The
net recurrent
network.
work. One weigh
prop ts mapping
proposed
osed (Jaegerfrom, 2003h; Maass h al.and
to et the; Jaeger
, 2002 input weighand ts
Haasmapping
, 2004;
from x to h are
Jaeger, 2007b) approach to av some of the
avoidingmost difficult parameters to
oiding this difficulty is to set the recurren learn in a recurrent
recurrentt weigh
weights ts
net
suc
suchwork. One prop
h that the recurren osed ( Jaeger , 2003
recurrentt hidden units do a go; Maass et
goood job of capturing the history of;
al. , 2002 ; Jaeger and Haas , 2004
Jaeger , 2007band
past inputs, ) approach to avoiding this difficulty is. toThis set the recurren
is the t weigh
idea that ts
was
suc h that
indep
independen
enden
endentlythe
tly recurren
prop osedt for
proposed hidden
echo units
state do a goodorjob
networks ESNsof capturing
(Jaeger and theHaas
history
, 2004 of;
past inputs, and
Jaeger, 2007b) and liquid state machines (Maass et al. . This is the idea that
al.,, 2002). The latter is similar, w as
indep enden tly prop osed for e cho state networks
except that it uses spiking neurons (with binary outputs) or ESNsinstead
(Jaegerofand the Haas
contin, 2004
uous-;
continuous-
Jaeger
v alued, hidden
2007b) and unitsliquid
usedstate
for machines
ESNs. Both (MaassESNset al.
and, 2002 ). The
liquid statelatter is similar,
machines are
except that it uses spiking
termed reservoir computing (Luk neurons (with
Lukoševičius binary outputs) instead of
oševičius and Jaeger, 2009) to denote the fact the contin uous-
vthat
alued hidden units used
the hidden units form of reservfor ESNs.
reservoir Both
oir ESNs
of temp
temporal and
oral liquid which
features state machines
may capture are
termed
differen reservoir
differentt aspaspects computing ( Luk
ects of the history of inputs.oševičius and Jaeger , 2009 ) to denote the fact
that the hidden units form of reservoir of temporal features which may capture
One waway y to think ab about
out these reservoir computing recurrent netw networks
orks is that
different aspects of the history of inputs.
they are similar to kernel mac machines:
hines: they map an arbitrary length sequence (the
history of inputs up to time t ) intoreservoir
One wa y to think ab out these a fixed-lengthcomputing
vectorrecurrent
(the recurrennetwtorks
recurrent stateishthat
(t) ),
they are similar to kernel
on which a linear predictor (t mac hines:
(typically they map an arbitrary length
ypically a linear regression) can be applied to solv sequence (the
solvee
history
the of inputs
problem of interest. Thet )training
up to time into a fixed-length
criterion may vector
then(the recurren
be easily t state to
designed h b),e
on which
con vex asaalinear
convex functionpredictor
of the (t ypically
output a linear
weigh
weights. ts. Fregression)
or example,can be applied
if the to solve
output consists
thelinear
of problem of interest.
regression from The trainingunits
the hidden criterion
to themay thentargets,
output be easily designed
and to be
the training
con vex as a function of the output
criterion is mean squared error, then it is conv weigh ts. F
convexor example,
ex and ma may if the output consists
y be solved reliably with
of linearlearning
simple regression from the(Jaeger
algorithms hidden units). to the output targets, and the training
, 2003
criterion is mean squared error, then it is convex and may be solved reliably with
The imp
important
ortant question is therefore: ho howw do we set the input and recurrent
simple learning algorithms (Jaeger, 2003).
weights so that a rich set of histories can be represented in the recurren recurrentt neural
net The
network imp ortant question
work state? The answer prop is therefore:
proposed ho w do we set the input
osed in the reservoir computing literature and recurrentis to
weights so that a rich set of histories can be represented in the recurrent neural
network state? The answer proposed in 406the reservoir computing literature is to
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

view the recurren


recurrentt net as a dynamical system, and set the input and recurrent
weights suc such h that the dynamical system is near the edge of stabilit stability y.
view the recurrent net as a dynamical system, and set the input and recurrent
The original idea was to make the eigen eigenv values of the Jacobian of the state-to-
weights such that the dynamical system is near the edge of stability.
state transition function be close to 1. As explained in Sec. 8.2.5, an imp importan
ortan
ortantt
The original idea
characteristic of a recurren w as to
recurrentt netw make
network the eigen
ork is the eigenvv alues
eigenvalue of the
alue sp Jacobian
spectrum of the
ectrum of the Jacobians state-to-
state
( t) transition
∂s function be close to 1. As explained in Sec. 8.2.5 , an important
J = ∂s . Of particular imp importance
ortance is the sp speectr al radius of J(t) , defined to be
ctral
characteristic of a recurrent network is the eigenvalue spectrum of the Jacobians
the maximum of the absolute values of its eigenv eigenvalues.
alues.
J = . Of particular importance is the spectral radius of J , defined to be
To understand the effect of the sp spectral
ectral radius, consider the simple case of
the maximum of the absolute values of its eigenvalues.
bac k-propagation with a Jacobian matrix J that does not change with t . This
back-propagation
caseThapp
o understand
happens,
ens, for example, the effect when of the
the sp net ectral
network
work is radius,
purelyconsider
linear. Supp the ose
Supposesimplethatcase
J has of
baceigenv
an k-propagation
ector v with
eigenvector withcorresp
a Jacobian
corresponding ondingmatrix eigenv J
eigenvalue that
alue λ . does
Considernot cwhathangehapp with
happens t . asThis
ens we
case happ ens,
propagate a gradien for example,
gradientt vector backw when the
backwards net work is purely linear. Supp
ards through time. If we begin with a gradient ose that J has
an eigenv ector v with corresp
vector g , then after one step of back-propagation, onding eigenvalue λ . Considerwe will hav what
have e J happ
g, and ensafter
as wen
propagate
steps we will hava gradien
havee J g. No t
n vectorNow backw ards
w consider what happ throughhappenstime. If we b egin with
ens if we instead back-propagate a gradient
vaector
perturb g , then
perturbed ed versionafter one of g.step If wof back-propagation,
e begin with g + δv,we thenwillafter
haveone J g,step,
and w after
e willn
steps
ha
haveve Jw(egwill+ δhavv). eAfterJ g. nNosteps, w consider we will whathav
havehapp
e J n(ens
g +ifδwe v ). instead
From this back-propagate
we can see
a perturb
that bac ed v ersion of g . If
k-propagation starting from g and bac
back-propagation w e begin with g + δ v
k-propagation starting from w
,
back-propagationthen after one step, ge+will
δv
ha
div ve J (g + δ vn). After
erge by δJ v after n steps of bac
diverge n steps, w e will hav e J (g + δ v ). F rom
k-propagation. If v is chosen to be a unit
back-propagation. this w e can see
that bac
eigen
eigenvector k-propagation
vector of J with eigen starting
eigenv value fromλ, gthenandmultiplication
back-propagation starting
by the g + δv
from simply
Jacobian
divergethe
scales δJ v afteratneac
bydifference steps
each h step. of bac Thek-propagation.
two executions If vofisbac chosen to be a unit
back-propagation
k-propagation are
eigen vector of J with eigen
separated by a distance of δ|λ| . When v corresp v alue
n λ , then multiplication
corresponds by the Jacobian
onds to the largest value of |λ| , simply
scales the difference
this perturbation achiev at
achieves eac h step. The two
es the widest possible separation of executions of anbacinitial
k-propagation
perturbation are
size δ . by a distance of δ λ . When v corresponds to the largest value of λ ,
separated
of
this perturbation achieves the | widest
| possible separation of an initial perturbation | |
When |λ| > 1, the deviation size δ|λ|n gro grows
ws exp onentially large. When |λ| < 1,
exponentially
of size δ .
the deviation size becomes exp exponentially
onentially small.
When λ > 1, the deviation size δ λ grows exponentially large. When λ < 1,
the Of course,size
deviation| |
thisbecomesexampleexp assumed
onentially that the Jacobian was the same at every
| | small. | |
time step, corresp
corresponding onding to a recurren recurrentt netw networkork with no nonlinearit
nonlinearity y. When a
Of course,
nonlinearit
nonlinearity y is this
present, example the deriv assumedativee that
derivativ
ativ of the thenonlinearit
Jacobianywas
nonlinearity willthe same atzero
approach every
on
time
man
many step, corresp onding
y time steps, and help to preven to a recurren t netw ork with no
preventt the explosion resulting from a large spnonlinearit y . Whenectrala
spectral
nonlinearit y is present,
radius. Indeed, the most recent work on ec the deriv ativ e of the
echo nonlinearit
ho state netw y
networks will
orks adv approach
ocates usingon
advocates zero a
man
sp y time
spectral
ectral steps,
radius muc
much and help tothan
h larger prevenunityt the explosion
(Yildiz et al.resulting
, 2012; Jaegerfrom ,a2012large). spectral
radius. Indeed, the most recent work on echo state networks advocates using a
Ev
Everything
erything we hav havee said ab about
out back-propagation via rep repeated
eated matrix multipli-
spectral radius much larger than unity (Yildiz et al., 2012; Jaeger, 2012).
cation applies equally to forward propagation in a netw network ork with no nonlinearity
nonlinearity,,
Ev erything
where the state h we ( thav
+1) e=h said ( t )ab
>
W.out back-propagation via rep eated matrix multipli-
cation applies equally to forward propagation in a network with no nonlinearity,
When a linear map=Wh> alw ays shrinks h as measured by the L 2 norm, then
always
where the state h W.
we say that the map is contr ontractive
active
active.. When the sp spectral
ectral radius is less than one, the
When a
mapping from h to hlinear ( t ) map (Wt +1) alw ays
is contractiv shrinks
contractive, h as
e, so a small measured
changeby the L smaller
becomes norm, thenafter
w e
eac
each say that the map is
h time step. This necessarily mak c ontr active .
makesWhen the
es the netw sp
networkectral radius is
ork forget information abless than one,
out the
about the
mapping from h to h is contractive, so a small change becomes smaller after
each time step. This necessarily makes407 the network forget information about the
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

past when we use a finite lev level


el of precision (such as 32 bit integers) to store the
state vector.
past when we use a finite level of precision (such as 32 bit integers) to store the
The Jacobian matrix tells us ho howw a small change of h(t) propagates one step
state vector.
forw
forward,
ard, or equiv alently,, how the gradient on h(t+1) propagates one step backw
equivalently
alently backward,ard,
The Jacobian matrix tells us ho w a small
during back-propagation. Note that neither W nor J need to be symmetric (al- change of h propagates one step
forward,they
though or equiv alently,and
are square howreal),
the gradient
so they can on hhav propagates
havee complex-v
complex-valued one eigenv
alued step backw
alues ard,
eigenvalues and
during
eigen back-propagation.
eigenvectors,
vectors, with imaginary comp Note that
components neither
onents corresp W nor
corresponding J need to b e symmetric
onding to potentially oscillatory (al-
bthough
eha
ehavior
vior they(if are
the square
same Jacobianand real),wsoasthey can hav
applied e complex-v
iteratively). Evalued
Even eigenvalues
en though h(t) orand a
eigen vectors, with
small variation of h of in imaginary
( t ) comp
interest onents
terest in bac corresp
back-propagation onding
k-propagation are real-v to p otentially
real-valued, oscillatory
alued, they can
bbeeha vior (if the
expressed sameaJacobian
in such complex-v
complex-valued walued
as applied
basis.iteratively).
What matters Even though
is what happh ens
happens or to
a
small
the variation (complex
magnitude of h of absolute interest in back-propagation
value) of these possibly are real-v alued,
complex-v
complex-valued theybasis
alued can
b
co eefficien
expressed
coefficien
efficients, in such
ts, when weam complex-v
ultiply the alued basis.byWhat
matrix mattersAn
the vector. is what
eigenvhapp
alueens
eigenvalue to
with
the magnitude (complex
magnitude greater than one corresp absolute v
corresponds alue) of these p
onds to magnification (exp ossibly complex-v
(exponential alued
onential growth, if basis
coefficien
applied ts, when wore shrinking
iteratively) multiply the (exp matrix
onentialby
(exponential the, vector.
decay
decay, if appliedAn eigenvalue with
iteratively).
magnitude greater than one corresponds to magnification (exponential growth, if
Withiteratively)
applied a nonlinear map, the (exp
or shrinking Jacobian
onential is decay
free to , ifchange
appliedatiteratively).
each step. The
dynamics therefore become more complicated. How Howev ev
ever,
er, it remains true that a
With a nonlinear
small initial variation can turn in map, the Jacobian
to a large variation after at
into is free to change sev each
several step. One
eral steps. The
dynamics btherefore
difference et
etween
ween the becomepurelymore linearcomplicated.
case and theHow ever, itcase
nonlinear remains
is thattruethe that
use ofa
asmall
squashinginitialnonlinearit
variation ycan
nonlinearity such turn into can
as tanh a large
causevariation
the recurrentafter dynamics
several steps.
to become One
bdifference
ounded. bNote etweenthat the it purely linear case
is possible andk-propagation
for bac the nonlinear case
back-propagation is thatun
to retain the
unboundeduse of
bounded
a squashing
dynamics ev nonlinearit
even
en when forw y such
forward as tanh can cause the recurrent
ard propagation has bounded dynamics, for example, dynamics to b ecome
bounded.
when Note of
a sequence thattanh it units
is possible
are all for bacmiddle
in the k-propagation to retain
of their linear unbounded
regime and are
dynamics
connected by weighev en when forw ard
weightt matrices with sp propagation
spectral has b ounded dynamics,
ectral radius greater than 1. Ho for
Howev
wevexample,
wever,
er, it is
when a sequence of tanh units are all in the middle
rare for all of the tanh units to simultaneously lie at their linear activ of their linear regime
activation and
ation point. are
connected by weight matrices with spectral radius greater than 1. However, it is
The strategy of echo state netw networks
orks is simply to fix the weigh weights ts to hav
havee some
rare for all of the tanh units to simultaneously lie at their linear activation point.
sp
spectral
ectral radius such as 3, where information is carried forward through time but
do
doesesThe
notstrategy
explo
explode de ofdueecho state
to the networkseffect
stabilizing is simply to fix thenonlinearities
of saturating weights to hav e tanh
like some.
tanh.
spectral radius such as 3, where information is carried forward through time but
doesMore recently
recently,
not explo de ,dueit has b een
to the sho
shownwn that
stabilizing effecttheoftechniques
saturating used to set thelike
nonlinearities weigh
weights
tanhts.
in ESNs could be used to the weights in a fully trainable recurrent net-
workMore (withrecently , it has b een sho
the hidden-to-hidden wn that weigh
recurrent the techniques
weights ts trained used usingto set the weights
back-propagation
in ESNs time),
through could b e used to
helping to learn long-term the wdepeights
dependenciesin a fully
endencies trainable
(Sutsk
Sutskev ev
ever recurrent
er, 2012 ; Sutskev
Sutskevernet-
er
w ork
et al. (with the hidden-to-hidden recurrent weigh ts trained
al.,, 2013). In this setting, an initial spectral radius of 1.2 performs well, combined using back-propagation
with the time),
through sparsehelping
initializationto learn long-term
scheme deped
describ
described endencies
in Sec. (8.4 Sutsk
. ever, 2012; Sutskever
et al., 2013). In this setting, an initial spectral radius of 1.2 performs well, combined
with the sparse initialization scheme described in Sec. 8.4.

408
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.9 Leaky Units and Other Strategies for Multiple


Time Scales
10.9 Leaky Units and Other Strategies for Multiple
One way Time
way Scales
to deal with long-term dep
dependencies
endencies is to design a mo
model
del that op
operates
erates
at multiple time scales, so that some parts of the mo model
del op
operate
erate at fine-grained
One wa y to deal with long-term dep endencies is to design
time scales and can handle small details, while other parts op a mo
operatedel at
erate that operates
coarse time
at m ultiple time scales, so that some parts of the mo del
scales and transfer information from the distant past to the presenop erate at fine-grained
presentt more efficiently
efficiently..
time
V scales
arious and can
strategies forhandle small
building bothdetails,
fine andwhile other
coarse parts
time operate
scales are pat coarseThese
ossible. time
scales and
include thetransfer
additioninformation from the distant
of skip connections across past
time,to“leaky
the presen
units”t more
that efficiently
integrate.
Various strategies
signals with differen for building b oth fine and coarse
differentt time constants, and the remoremov time scales are p ossible.
val of some of the connectionsThese
include
used to the
mo
modeladdition
del of skiptime
fine-grained connections
scales. across time, “leaky units” that integrate
signals with different time constants, and the removal of some of the connections
used to model fine-grained time scales.

One wawayy to obtain coarse time scales is to add direct connections from variables in
the distant past to variables in the present. The idea of using such skip connections
One wa
dates y totoobtain
back Lin etcoarse time
al. (1996 scales
) and is to add
follows fromdirect connections
the idea of incorp from
incorporating variables
orating dela
delays in
ys in
the distant
feedforw ardpast
feedforward to variables
neural netw orksin(the
networks Lang present. The idea
and Hinton , 1988of ).using
In an such skip connections
ordinary recurrent
dates
net back
network, to Lin et al. ( 1996
work, a recurrent connection go ) andgoesfollows from the idea of incorp
es from a unit at time t to a unit at time orating delatys
+ 1in.
feedforw
It ard neural
is p ossible networks
to construct (Lang net
recurrent andworks
Hinton
networks , 1988
with ). Indela
longer anys
delays ordinary
(Bengiorecurrent
, 1991).
network, a recurrent connection goes from a unit at time t to a unit at time t + 1.
As we hav
havee seen in Sec. 8.2.5, gradien
gradients ts may vanish or explo explodede exp
exponentially
onentially
It is p ossible to construct recurrent networks with longer delays (Bengio, 1991).
. Lin et al. (1996) in introduced
troduced
As we
recurren
recurrent have seen inwith
t connections Sec.a8.2.5 , gradien
time-delay oftsd may vanish or
to mitigate thisexplo de expGradients
problem. onentially
no
noww diminish exp onentially as a function of τd rather
exponentially . Lin thanetτal. (1996
. Since ) introduced
there are both
recurren
dela
delay t connections with a time-delay
yed and single step connections, gradien of
gradientsd to mitigate
ts may still explo this
explode problem.
de exp
exponen
onen Gradients
tially in τ.
onentially
no w diminish exp onentially as a function of
This allows the learning algorithm to capture longer dep rather than τ
dependencies. Since there
endencies although arenot
both
all
delayed and
long-term single
dep step connections,
dependencies
endencies may be represen gradien
representedtedtswell
mayinstill
thisexploay..de exp onentially in τ.
way
This allows the learning algorithm to capture longer dependencies although not all
long-term dep endencies may be represented well in this way.

Another wa
wayy to obtain paths on which the product of deriv derivativ
ativ
atives
es is close to one is to
ha
have
ve units with self-connections and a weighweightt near one on these connections.
Another way to obtain paths on which the product of derivatives is close to one is to
When we accumulate a running average µ t) of some value v (t) by applying the
(
have units with self-connections and a weight near one on these connections.
up date µ(t) ← αµ (t−1) + (1 − α) v(t) the α parameter is an example of a linear self-
update
When we
connection accumulate
from µ(t−1) to aµ(running
t)
. Whenaverage
α is near of some
µ one, value vav
the running by applying
average
erage remem
rememb bthe
ers
up date µ
information ab αµ + (1 α) v the α parameter is an example of
out the past for a long time, and when α is near zero, information
about a linear self-
connection
ab
about ← isµ rapidly
from
out the past to µdiscarded. α is nearunits
−. When Hidden one, with
the running average remembcan
linear self-connections ers
information
b eha
ehave about
ve similarly to the
suchpast for aav
running long time,Such
averages.
erages. when αunits
andhidden is near
are zero,
calledinformation
le
leaky
aky units
units..
about the past is rapidly discarded. Hidden units with linear self-connections can
behave similarly to such running averages. 409 Such hidden units are called leaky units.
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Skip connections through d time steps are a wa way y of ensuring that a unit can
alw ays learn to be influenced by a value from d time steps earlier. The use of a
always
Skip
linear connections with
self-connection through d time
a weigh
weight steps
t near oneare is aa different
way of ensuring
wa
way that a unit
y of ensuring that can
the
alw ays learn to b e influenced by a value from d time steps earlier.
unit can access values from the past. The linear self-connection approach allows The use of a
linear self-connection with
this effect to be adapted more smo a weigh t near
smoothly one is a different way of ensuring
othly and flexibly by adjusting the real-v that the
real-valued
alued
unit
α can than
rather accessbyvalues from the
adjusting theinteger-v
past. The
integer-valued
aluedlinear
skipself-connection
length. approach allows
this effect to be adapted more smoothly and flexibly by adjusting the real-valued
These ideas were prop proposed
osed by Mozer (1992) and by El Hihi and Bengio (1996).
α rather than by adjusting the integer-valued skip length.
Leaky units were also found to be useful in the con context
text of ec
echo
ho state net
netwworks
These ideas were
(Jaeger et al., 2007). prop osed by Mozer ( 1992 ) and by El Hihi and Bengio (1996).
Leaky units were also found to be useful in the context of echo state networks
There
(Jaeger are, 2007
et al. tw
twoo ).
basic strategies for setting the time constants used by leaky
units. One strategy is to manually fix them to values that remain constant, for
Therebyare
example two basic
sampling theirstrategies
values from forsome
setting the time once
distribution constants used by leaky
at initialization time.
units. One strategy is to manually fix them to values that
Another strategy is to make the time constants free parameters and learn them. remain constant, for
example
Ha
Having by sampling
ving such leaky unitstheiratvalues from
differen
different somescales
t time distribution
appears once at initialization
to help time.
with long-term
Another
dep strategy
dependencies
endencies is to
(Mozer make
, 1992 the time
; Pascan
Pascanu u etconstants
al.,, 2013afree
al. ). parameters and learn them.
Having such leaky units at different time scales appears to help with long-term
dependencies (Mozer, 1992; Pascanu et al., 2013a).

Another approach to handle long-term dep dependencies


endencies is the idea of organizing
the state of the RNN at multiple time-scales (El Hihi and Bengio, 1996), with
Another
information approach
flowingtomore handle long-term
easily dependencies
through long distances is at the
the idea
slow
slowereroftime
organizing
scales.
the state of the RNN at multiple time-scales (El Hihi and Bengio, 1996), with
This ideaflowing
information differsmore
fromeasily
the skip connections
through through
long distances timeslow
at the discussed earlier
er time scales.
because it invinvolves
olves actively length-one connections and replacing them
This idea differs from
with longer connections. Units mo the skip
dified in such a through
modifiedconnections wa
way time discussed
y are forced to op erateearlier
operate on a
because
long timeit scale.
involves actively
Skip connections through length-one
time connections and replacing
edges. Units receiving themsuc
suchh
with longer connections. Units mo dified in such a way are forced
new connections may learn to operate on a long time scale but may also choose to to op erate on a
long
fo
focus time scale. Skip connections
cus on their other short-term connections.through time edges. Units receiving suc h
new connections may learn to operate on a long time scale but may also choose to
There are different wa ways
ys in which a group of recurrent units can be forced to
focus on their other short-term connections.
op
operate
erate at different time scales. One option is to make the recurrent units leaky leaky,,
There
but to hav are different wa ys in which
havee different groups of units asso a group
associated of recurrent
ciated with differen units can b e forced
differentt fixed time scales. to
op erate at different
This was the prop time
proposal scales. One option is to make the recurrent
osal in Mozer (1992) and has been successfully used in Pascan units leaky
Pascanu u,
but
et al.to(2013a
have ).different
Anothergroups
optionofis units
to hav
haveasso ciatedand
e explicit with differen
discrete t dates
up fixed taking
updates time scales.
place
This w as the prop osal in Mozer ( 1992 ) and has been successfully
at different times, with a different frequency for different groups of units. This is used in Pascan u
et al.approach
the (2013a). ofAnother
El Hihioption is to hav
and Bengio e explicit
(1996 ) and and discrete
Koutnik up(dates
et al. 2014). taking place
It work
workeded
at different times, with a different
well on a number of benchmark datasets. frequency for different groups of units. This is
the approach of El Hihi and Bengio (1996) and Koutnik et al. (2014). It worked
well on a number of benchmark datasets.

410
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.10 The Long Short-T


Short-Term erm Memory and Other Gated
RNNs
10.10 The Long Short-Term Memory and Other Gated
RNNs
As of this writing, the most effectiv
effectivee sequence mo
models
dels used in practical applications
are called gate
gatedd RNNs. These include the long short-term memory and netw networks
orks
As of this writing,
based on the gate gated the most
d recurr
current effectiv
ent unit
unit.. e sequence mo dels used in practical applications
are called gated RNNs. These include the long short-term memory and networks
Like leaky
Like units, gated RNNs are based on the idea of creating paths through
based on the gated recurrent unit.
time that hav havee deriv
derivatives
atives that neither v vanish
anish nor explo
explode.
de. Leaky units did
Like leaky units,
this with connection weigh gated
weights RNNs are based on
ts that were either man the idea
manually of creating paths
ually chosen constan
constantsts through
or were
time that hav e deriv atives that neither vanish nor
parameters. Gated RNNs generalize this to connection weights that ma explo de. Leaky mayunits did
y change
thiseach
at withtime
connection
step. weights that were either manually chosen constants or were
parameters. Gated RNNs generalize this to connection weights that may change
Leaky units allow the netw network
ork to information (such as evidence
at each time step.
for a particular feature or category) ov over
er a long duration. Ho Howev
wev
wever,
er, once that
Leaky units allow the netw ork to
information has been used, it might be useful for the neural net information work toas evidence
(such
network the
for state.
old a particular featureifor
For example, category)is ov
a sequence er a oflong
made duration. Ho
sub-sequences andwev
weer,
wanonce
want that
t a leaky
information
unit to accum has
accumulate been
ulate used, itinside
evidence mighteachbe useful for the neural
sub-subsequence, wenet work
need to
a mechanism the
to
old state. F or example, if a sequence is made of sub-sequences
forget the old state by setting it to zero. Instead of manually deciding when toand we wan t a leaky
unit to
clear theaccum
state,ulate evidence
we wan
want inside each
t the neural netw sub-subsequence,
network we need
ork to learn to decide whena mechanism
to do it. Thisto
forget
is whatthe old RNNs
gated state bdo.
y setting it to zero. Instead of manually deciding when to
clear the state, we want the neural network to learn to decide when to do it. This
is what gated RNNs do.

The clevclever
er idea of introducing self-lo self-loops
ops to proproduce
duce paths where the gradient can
flo
floww for long durations is a core contribution of the initial long short-term memory
The
(LSTM) clevermoidea
del (of
model Ho
Hocintroducing
chreiter andself-lo ops to
Schmidh
Schmidhub ub
uberpro
er duce).paths
, 1997 whereaddition
A crucial the gradient
has bcan
een
flowmak
to for
makee long
the durations
weigh
weightt on is
thisa core
self-lo contribution
self-loop op conditioned of theon initial
the long
context, short-term
rather memory
than fixed
((LSTM)
Gers et mo al.,del
2000(Ho
). chreiter
By making and Schmidh
the weigh ubterof, 1997
eight this ). A crucial
self-loop addition
gated has been
(controlled by
to mak e the weigh t on this self-lo
another hidden unit), the time scale of in op conditioned
integration on the context, rather
tegration can be changed dynamically than fixed
dynamically.. In
(this
Gerscase,
et al. , 2000 ). By making the w eigh t of this
we mean that even for an LSTM with fixed parameters, the self-loop gated (controlled
time scale by of
another
in
integrationhidden unit), the time scale of integration can
tegration can change based on the input sequence, because the time constants b e changed dynamically . In
this output
are case, webymeanthe mo that
modeldeleven for The
itself. an LSTM
LSTM with
hasfixed beenparameters,
found extremelythe time scale of
successful
integration
in man
many can change such
y applications, basedasonunconstrained
the input sequence, becauserecognition
handwriting the time constants
(Gra
Grav ves
are output
et al. by the mo del itself.
al.,, 2009), speech recognition (Gra The LSTM
Grav has
ves et al. b een found
al.,, 2013; GravGraves extremely successful
es and Jaitly, 2014),
in many applications,
handwriting generation (such Gra as
Graves
ves , unconstrained
2013 ), mac
machine
hine handwriting
translation ( recognition
Sutsk
Sutskevever et al.
ever (Gra ves),
, 2014
al.,
et al., captioning
image 2009), speech recognition
(Kiros (Gra;vVin
et al., 2014b es et
Vinyals
yalsal.et
, 2013
al. ; Grav; es
al.,, 2014b Xuand Jaitly
et al.
al., , 2014
, 2015 ) and),
handwriting
parsing (Viny generation
Vinyals
als et al.al.,, (2014a
Graves). , 2013), machine translation (Sutskever et al., 2014),
image captioning (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et al., 2015) and
The (LSTM
parsing Vinyalsbloblock
et ckal.,diagram
2014a). is illustrated in Fig. 10.16. The corresp corresponding
onding
forw
forward
ard propagation equations are giv
given
en b elow, in the case of a shallo
shalloww recurrent
The LSTM block diagram is illustrated in Fig. 10.16. The corresponding
forward propagation equations are given 411b elow, in the case of a shallow recurrent
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.16: Blo Block


ck diagram of the LSTM recurren recurrentt netw
network
ork “cell.” Cells are connected
recurren
recurrently
tly to each other, replacing the usual hidden units of ordinary recurrent netw networks.
orks.
Figure
An 10.16:
input Block
feature is diagram
computed of with
the LSTM recurren
a regular t netw
artificial ork “cell.”
neuron unit.Cells
Its are
valueconnected
can be
recurren
accum tly to into
accumulated
ulated eachtheother, replacing
state the usual input
if the sigmoidal hiddengate
unitsallo
of ws
ordinary
allows it. The recurrent
state unitnetw orks.
has a
An input
linear feature
self-lo
self-loop is computed
op whose weightt is with
weigh a regular
controlled artificial
by the neuron
forget gate. Theunit.
outputIts ofvalue can can
the cell be
accum
b e shutulated
off byinto the state
the output if the
gate. All sigmoidal
the gating input gate
units hav e aallo
have ws it. nonlinearity
sigmoid The state unit
nonlinearity, hasthea
, while
linear self-lo op
input unit can hav whose weigh t is controlled
nonlinearity.. The state unit can also b e used ascan
havee any squashing nonlinearity by the forget gate. The output of the cell an
be shut
extra off by
input to the
the output
gating gate.
units.AllThetheblack
gating unitsindicates
square have a sigmoid
a delaynonlinearity , while the
of 1 time unit.
input unit can have any squashing nonlinearity. The state unit can also b e used as an
extra input to the gating units. The black square indicates a delay of 1 time unit.

412
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

net
netwo
wo
workrk architecture. Deep Deeper er architectures ha havve also been successfully used (Gra Graves
ves
et al.
al.,, 2013; Pascan
Pascanu u et al.al.,, 2014a). Instead of a unit that simply applies an element-
netwo
wise rk architecture.
nonlinearity to the Deep
affine er architectures
transformation havofe also
inputsbeen
andsuccessfully used (Gra
recurrent units, LSTM ves
et al., 2013
recurren
recurrent ; Pascan
t netw orks uha
networks havetveal. , 2014acells”
“LSTM ). Instead of aeunit
that hav
have an inthat
internalsimply
ternal applies(aanself-lo
recurrence element-
self-loop),
op),
wise nonlinearity to the affine transformation
in addition to the outer recurrence of the RNN. Eac of inputs
Each and recurrent units,
h cell has the same inputsLSTM
recurren
and t netwasorks
outputs an ha ve “LSTM
ordinary cells” that
recurrent netwhav
network,e anbut
ork, internal recurrence
has more (a self-lo
parameters andop),
a
in addition to the outer recurrence of the RNN. Eac
system of gating units that controls the flow of information. The most imph cell has the same inputs
importan
ortan
ortantt
and outputs as an ordinary recurrent (t) network, but has more parameters and a
comp
componenonentt is the state unit si that has a linear self-lo
onen self-loop
op similar to the leaky
system of gating units that controls the flow of information. The most important
units describ
describeded in the previous section. How Howevev
ever,
er, here, the self-lo
self-loop
op weigh
weightt (or the
component is the state unit s that has a linear self-loop similar (t) to the leaky
asso
associated
ciated time constant) is controlled by a for get gate unit fi (for time step t
forget
units described in the previous section. However, here, the self-loop weight (or the
and cell i), that sets this weigh weightt to a value betw etween
een 0 and 1 via a sigmoid unit:
associated time constant) is controlled by a forget gate unit f (for time step t
and cell i), that sets this weight to a value between 0 and 1 via a sigmoid unit:
(t) f (t) f (t−1)
f i = σ bfi + Ui,j xj + Wi,j hj , (10.33)
j j
f =σ b + U x + W h , (10.33)
where x (t) is the current input vector and h(t) is the current hidden lay layer
er vector,
con taining the outputs of all the LSTM cells, and bf ,U f , W f are resp
containing respectively
ectively
where x is the
biases, input weigh current
weights input vector
ts and recurrent weighand
weightsh is the current hidden lay
ts for the forget gates. The LSTM er vector,
cell
con
in taining
internal
ternal theisoutputs
state thus up of all the
updated
dated LSTM cells,
as follows, but withand ab conditional
,U , W are resp
self-lo opectively
self-loop weigh
weightt
biases,
(t) input weights and recurrent weights for the forget gates. The LSTM cell
fi :
internal state is thus updated as follows, but with a conditional self-loop weight
f :
(t) (t) (t−1) (t) (t) (t−1)
si = fi s i + gi σ bi + Ui,j xj + Wi,j hj , (10.34)
j j
s =f s +g σ b + U x + W h , (10.34)
where b, U and W resp respectiv
ectiv
ectively
ely denote the biases, input weigh
weights
ts and recurren
recurrentt
(t)
eightss into the LSTM cell. The external input gate unit g is computed similarly
weight
where b, U and W respectively denote the biases, inputi weights and recurrent
to the forget gate (with a sigmoid unit to obtain a gating value betw etween
een 0 and 1),
weight s into the LSTM cell.
but with its own parameters: The external input gate unit g is computed similarly
to the forget gate (with a sigmoid unit to obtain a gating value between 0 and 1),
but with its own parameters:
(t) g (t) g (t−1)
gi = σ bgi + Ui,j xj + Wi,j hj . (10.35)
j j
g =σ b + U x + W h . (10.35)
(t) (t)
The output hi of the LSTM cell can also be shut off, via the output gate qi ,
whic
which
h also uses a sigmoid unit for gating:
The output h of the LSTM cell can also be shut off, via the output gate q ,
(t) (t) (t)
which also uses a sigmoid unit for
h i = tanh si gating:
qi (10.36)

h = tanh s q (t−1)
(10.36)
(t) (t)
qi = σ boi + Uoi,j xj + o
Wi,j hj (10.37)
j j
q =σ b + U x + W h (10.37)
413
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

whic
which h has parameters bo, U o, W o for its biases, input weigh weightsts and recurren
recurrentt
(t)
weights, resp ectively.. Among the variants, one can choose to use the cell state s i
respectively
ectively
which has parameters b , U , W for its biases, input weights and recurrent
as an extra input (with its weight) into the three gates of the i-th unit, as sho shown
wn
w
ineights, respectively
Fig. 10.16 . Among
. This would the vthree
require ariants, choose to use the cell state s
one can parameters.
additional
as an extra input (with its weight) into the three gates of the i-th unit, as shown
LSTM netnetworks
works hav
havee been shoshown
wn to learn long-term dep dependencies
endencies more easily
in Fig. 10.16. This would require three additional parameters.
than the simple recurrent architectures, first on artificial data sets designed for
LSTM
testing the net works
ability tohav
ability e blong-term
learn een showndep to endencies
learn long-term
dependencies dep
(Bengio etendencies
al., 1994;more easily
Hochreiter
thanSc
and the simple recurrent
Schmidhuber
hmidhuber , 1997; Ho architectures,
Hochreiter
chreiter et al. first
al., on),artificial
, 2000 then ondata sets designed
challenging for
sequence
testing
pro the tasks
processing
cessing abilitywhere
to learn long-term deppendencies
state-of-the-art erformance (Bengio et al., 1994
was obtained ; Hochreiter
(Gra
Graves
ves, 2012;
and
Gra
GravesSc hmidhuber
ves et al. , 1997
al.,, 2013; Sutskev; Ho
Sutskever chreiter
er et al. et al. , 2000
al.,, 2014). Varian ),
ariants then on challenging
ts and alternativ
alternatives sequence
es to the LSTM
pro
ha vecessing
have tasks where
been studied state-of-the-art
and used performance
and are discussed next. was obtained (Graves, 2012;
Graves et al., 2013; Sutskever et al., 2014). Variants and alternatives to the LSTM
have been studied and used and are discussed next.

Whic
Which h pieces of the LSTM architecture are actually necessary? What other
successful architectures could be designed that allow the netw network
ork to dynamically
Whic
con
controlh pieces of the LSTM architecture
trol the time scale and forgetting beha ehaviorare actually necessary?
vior of different units? What other
successful architectures could be designed that allow the network to dynamically
Some answers to these questions are given with the recent work on gated RNNs,
control the time scale and forgetting behavior of different units?
whose units are also known as gated recurrent units or GR GRUsUs (Cho et al., 2014b;
Ch Some
Chung answers
ung et al. to these questions are
al.,, 2014, 2015a; Jozefowicz et al.given
al.,, 2015; Chrupala et wal.
with the recent ork
al., on gated
, 2015 ). TheRNNs,
main
whose
difference with the LSTM is that a single gating unit simultaneously controls the;
units are also known as gated recurrent units or GR Us ( Cho et al. , 2014b
Chung et al.
forgetting , 2014and
factor , 2015a
the ;decision
Jozefowicz et date
to up al., 2015
update the ;state
Chrupala
unit. etThe
al.,up
2015
update). equations
date The main
difference
are with the LSTM is that a single gating unit simultaneously controls the
the following:
forgetting factor and the decision to update the state unit. The update equations
are the following:
(t) (t−1) (t−1) (t−1) (t−1) (t−1) (t−1)
h i = ui hi + (1 − ui )σ bi + Ui,j xj + Wi,j rj hj ,
j j
h =u h + (1 u )σ b + U x + h (10.38)
, W r
where u stands for “up −
“update”
date” gate and r for “reset” gate. Their value is defined as
usual: (10.38)
where u stands for “update” gate and r for “reset” gate. Their value is defined as
(t) (t) u (t)
usual: ui = σ bui + Uui,jx j + W i,j hj (10.39)
j j
u =σ b + U x + W h (10.39)
and
(t) r (t) r (t)
and r i = σ b ri + Ui,j xj + W i,j hj . (10.40)
j j
r =σ b + U x + W h . (10.40)
The reset and up
updates
dates gates can individually “ignore” parts of the state vector.
The up
update
date gates act like conditional leaky integrators that can linearly gate any
The reset and updates gates can individually “ignore” parts of the state vector.
414 integrators that can linearly gate any
The update gates act like conditional leaky
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

dimension, th thus
us choosing to cop copyy it (at one extreme of the sigmoid) or completely
ignore it (at the other extreme) by replacing it by the new “target state” value
dimension,
(to
(towards
wards which thus the
choosing
leaky toin copy it (at
integrator
tegrator one to
wants extreme
conv of theThe
converge).
erge). sigmoid)
reset or completely
gates control
ignore
whic
which it (at the other extreme) b y replacing it by the
h parts of the state get used to compute the next target state, in new “target state”
intro
tro
troducing
ducingvalue
an
(towards which the leaky integrator w
additional nonlinear effect in the relationship betwants to conv erge).
etween The reset gates control
een past state and future state.
which parts of the state get used to compute the next target state, introducing an
Man
Many y more varian
ariantsts around this theme can be designed. For example the
additional nonlinear effect in the relationship between past state and future state.
reset gate (or forget gate) output could be shared across multiple hidden units.
Many more
Alternately
Alternately, varian
, the pro ducttsofaround
product a globalthis
gatetheme
(cov can baewhole
(covering
ering designed.
groupFoforunits,
example suchthe
such as
reset gate
an entire lay (or
layer) forget
er) and a logate)
local output
cal gate (p
(per could b e shared across m ultiple hidden
er unit) could be used to combine global control units.
Alternately
and lolocal , the product
cal control. How of a global
However,
ever, severalgatein (covering a whole
investigations
vestigations group of units,
over architectural such as
variations
an the
of entire
LSTMlayer)and
andGRUa local gate no
found (per unit)t could
varian
ariant be usedclearly
that would to combine
beat b global
oth ofcontrol
these
and lo cal control. How ever, several investigations
across a wide range of tasks (Greff et al., 2015; Jozefo ov er architectural
Jozefowicz
wicz et al. v ariations
al.,, 2015). Greff
of the LSTM and GRU found no
et al. (2015) found that a crucial ingredien v arian t that would clearly
ingredientt is the forget gate, whilebeat both of these
Jozefowicz
across
et a wide
al. (2015 rangethat
) found of tasks
adding (Greff
a biaset of
al.,1 2015
to the; Jozefo
LSTM wicz et al.
forget , 2015
gate, ). Greff
a practice
et
adv
advoal. (2015b)yfound
ocated Gers that
et al.a(crucial
2000), makingredien
makes es thet LSTM
is the forget gate,aswhile
as strong Jozefowicz
the best of the
et al. ( 2015 ) found that
explored architectural varian adding
ariants.
ts. a bias of 1 to the LSTM forget gate, a practice
advocated by Gers et al. (2000), makes the LSTM as strong as the best of the
explored architectural variants.
10.11 Optimization for Long-T
Long-Term
erm Dep
Dependencies
endencies

10.11
Sec. 8.2.5 Optimization
and Sec. 10.7 ha vefor
have Long-T
describ
describeded the ermvanishing Dep andendencies
explo
exploding
ding gradien
gradientt
problems that occur when optimizing RNNs ov over
er many time steps.
Sec. 8.2.5 and Sec. 10.7 have described the vanishing and exploding gradient
An in
interesting
teresting idea propproposed
osed by Martens and Sutskev Sutskever er (2011) is that second
problems that occur when optimizing RNNs over many time steps.
deriv
derivatives
atives may vanish at the same time that first deriv derivatives
atives vanish. Second-order
An interesting
optimization idea prop
algorithms mayosed by Martens
roughly be understoandoSutskev
understoo er (2011
d as dividing the) first
is that second
deriv
derivativ
ativ
ativee
deriv atives may
by the second deriv vanish
derivativ at
ativ the same time that first deriv atives
ativee (in higher dimension, multiplying the gradien v anish. Second-order
gradientt by the
optimization
in
inverse algorithms may
verse Hessian). If the second deriv roughly b e
derivativ
ativundersto o d as dividing
ativee shrinks at a similar rate the firsttoderiv
the ativ
firste
b y the
deriv second
derivative,
ative, thenderiv
theativ e (in
ratio higher
of first anddimension,
second deriv multiplying
derivativ
ativ
atives
es may theremain
gradienrelatively
t by the
in verse t.
constan
constant. Hessian). If the, second
Unfortunately
Unfortunately, derivativ
second-order e shrinks
metho
methods haveeatmany
ds hav a similar
drawbacrateks,toincluding
drawbacks, the first
derivative,
high then the cost,
computational ratio the
of first
needand
for second
a large deriv atives may
minibatch, and aremain
tendencyrelatively
to be
constan t. Unfortunately , second-order
attracted to saddle points. Martens and Sutskev metho ds
Sutskever hav e many drawbac ks,
er (2011) found promising results including
high computational
using second-order metho cost, the
methods.ds. need
Later,forSutsk
a large
everminibatch,
Sutskever et al. (2013and a tendency
) found to be
that simpler
attracted
metho
methods to saddle
ds suc
such points. Martens
h as Nesterov momentum and withSutskev er (2011
careful ) found promising
initialization could ac results
achieve
hieve
using second-order
similar results. See Sutsk metho
Sutskev ds.
ev
ever Later, Sutsk ever et al. ( 2013 ) found
er (2012) for more detail. Both of these approac that simpler
approaches hes
metho
ha
have ds suc h as Nesterov momentum with careful initialization
ve largely been replaced by simply using SGD (even without momentum) applied could ac hieve
similar
to LSTMs.results.
ThisSee Sutsk
is part of ev
a er (2012
contin
continuing) for
uing moreindetail.
theme mac
machine Both
hine of these
learning thatapproac hes
it is often
hauch
m ve largely
easier b toeen replaced
design a mo by
modeldelsimply
that isusing
easySGD (even without
to optimize than itmomentum)
is to design applied
a more
to LSTMs. This is part
powerful optimization algorithm.of a contin uing theme in mac hine learning that it is often
much easier to design a model that is easy to optimize than it is to design a more
powerful optimization algorithm.
415
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

As discussed in Sec. 8.2.4, strongly nonlinear functions suc suchh as those computed by
a recurrent net ov over
er many time steps tend to hav havee deriv
derivatives
atives that can be either
very large or very small in magnitude. This is illustrated inhFig.
As discussed in Sec. 8.2.4 , strongly nonlinear functions suc as those
8.3 and computed
Fig. 10.17 by,
a recurrent
in which we net see ov er the
that many ob time
jectivesteps
objectiv
jectiv tend (as
e function to hav e derivatives
a function of the that can be either
parameters) has a
v ery largee”
“landscap
“landscape” or in
very small
whic
which h onein finds
magnitude.
“cliffs”:Thiswideisand
illustrated
rather inflatFig. 8.3 and
regions Fig. 10.17
separated by,
in which
tin
tiny
y regionswe see
wherethatthetheobob jectiv
objectiv
jectiv
jectivee efunction
functionchanges
(as a function
quickly,,offorming
quickly the parameters)
a kind of hascliff.a
“landscape” in which one finds “cliffs”: wide and rather flat regions separated by
tinyThe difficulty
regions wherethatthe arises is ethat
ob jectiv when changes
function the parameterquickly,gradient
forming isa vkind
ery large,
of cliff.a
gradien
gradientt descen
descentt parameter up update
date could throw the parameters very far, into a
Thewhere
region difficulty that
the ob arises
objective
jective is that is
function when theundoing
larger, parameter much gradient
of the iswork
verythat
large,
hada
bgradien t descen
een done t parameter
to reach the current update couldThe
solution. throw the parameters
gradient tells us thevery far, into
direction thata
region
corresp where
corresponds the ob
onds to the steep jective
steepest function is larger, undoing m uch of the
est descent within an infinitesimal region surrounding the w ork that had
b een
curren done
currentt to reach
parameters. the current
Outside ofsolution. The gradient
this infinitesimal tellsthe
region, us cost
the direction
function that
ma
may y
corresp onds to the
begin to curve back upw steep est
upwards. descent
ards. The up within
date must be chosen to be small enoughthe
update an infinitesimal region surrounding to
acurren t parameters.
void tra
traversing
versing to too Outside
o muc
much h upw of
ardthis
upward curv infinitesimal
curvature. region, the
ature. We typically cost function
use learning rates ma
thaty
b egin
deca
decay to curve back upw ards. The
y slowly enough that consecutive steps hav up date must b
havee approe chosen
approximately to b e small enough
ximately the same learning to
avoid tra versing to o muc h upw ard curv ature. W
rate. A step size that is appropriate for a relatively linear part e typically use oflearning rates that
the landscap
landscape e is
decay inappropriate
often slowly enoughand thatcauses
consecutive steps hav
uphill motion if we eappro
enterximately
a more curvthe same
curveded partlearning
of the
rate. A step
landscap
landscape e onsize
the that
nextisstep.
appropriate for a relatively linear part of the landscap e is
often inappropriate and causes uphill motion if we enter a more curved part of the
landscape on the next step.

416
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.17: Example of the effect of gradien gradientt clipping in a recurrent net netw
work with
two parameters w and b . Gradient clipping can mak makee gradien
gradientt descent p erform more
Figure 10.17:
reasonably Example
in the of of
vicinity the effect of steep
extremely gradien t clipping
cliffs. These in a recurrent
steep networkowith
cliffs commonly ccur
tinworecurrent
parameters netww
networksand b . Gradient clipping
orks near where a recurren can
recurrentt netw mak
network e gradien
ork b ehav
ehaves t descent p
es approximately erform more.
linearly
linearly.
reasonably
The in the
cliff is exp vicinitysteep
exponentially
onentially of extremely
in the num steep
numb b er cliffs.
of timeThese
stepssteep cliffsthe
b ecause commonly
weigh o ccur
weightt matrix
in multiplied
is recurrent netw orksonce
by itself nearfor where
eachatime
recurren
step.t network b ehaves
Gradient approximately
descent linearly.
without gradient
The cliff ov
clipping is ersho
exp onentially
oversho
ershoots steep in
ots the b ottom of the
thisnum b erravine,
small of timethen
steps b ecause
receives the weigh
a very large tgradient
matrix
is multiplied
from the cliff by itself
face. Theonce
largeforgradien
each time
gradient step.
t catastrophically Gradient
prop
propels descent
els the without
parameters gradient
outside the
clipping
axes oversho
of the plot. ots the b ottom
Gradientof this small
descent ravine,
with then clipping
gradient receives has
a very largemo
a more gradient
moderate
derate
from the to
reaction cliff
theface.
cliff.The largeit gradien
While do
does t catastrophically
es ascend the cliff face,prop
theels thesize
step parameters outside
is restricted the
so that
axes
it of theb eplot.
cannot prop
propelled
elled aw
awa Gradient descent
ay from steep withnear
region gradient clipping has
the solution. a more
Figure mo derate
adapted with
preaction
ermissionto from
the cliff. While
Pascan
Pascanu u it do(es ascend
2013a ). the cliff face, the step size is restricted so that
it cannot b e prop elled away from steep region near the solution. Figure adapted with
permission from Pascanu (2013a).
A simple type of solution has been in use by practitioners for many years:
clipping the gr gradient
adient
adient.. There are differendifferentt instances of this idea (Mikolo Mikolovv, 2012;
A simple
Pascanu et al. type of solution has b een in use by practitioners
al.,, 2013a). One option is to clip the parameter gradien for many
gradientt from years:
a
clipping
minibatc
minibatch the gr adient . There
h element-wise (Mikolo are
Mikolov differen t instances of this
v, 2012) just before the parameter upidea ( Mikolo
date. Another;
update. v , 2012
Pascanu
is to clip et
theal. , 2013a
norm ||g||). ofOne option
the gradientisgto
gradient clip theetparameter
(Pascanu al.,, 2013a)gradien
al. t from
just before thea
minibatch element-wise
parameter up
update:
date: (Mikolov, 2012) just before the parameter up date. Another
is to clip the norm g of the gradient g (Pascanu et al., 2013a) just before the
parameter up date: || || if ||g|| > v (10.41)
gv
if g g>←v ||g|| (10.42)
(10.41)
|| ||g gv
(10.42)
where v is the norm threshold and g is used gto up update
date parameters. Because the
gradientt of all the parameters (including←different
gradien || || groups of parameters, such as
w eightsv and
where is the normisthreshold
biases) renormalizedand gjointly
is used to up
with date parameters.
a single scaling factor, Because the
the latter
gradien
metho
method dthasof all
thethe parameters
adv
advan
an tage that(including
antage it guaranteesdifferent groups
that each stepofisparameters, such as
still in the gradient
w eights and biases) is renormalized jointly with a
direction, but experiments suggest that both forms work similarlysingle scaling factor, the
similarly.. Althoughlatter
method has the advantage that it guarantees that each step is still in the gradient
direction, but experiments suggest that 417 b oth forms work similarly. Although
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

the parameter up update


date has the same direction as the true gradient, with gradient
norm clipping, the parameter up update
date vector norm is now bounded. This bounded
the parameter
gradien up date has the
gradientt avoids performing a detrimen same direction
detrimental tal step as when
the true the gradient,
gradientt with
gradien explo gradient
explodes.
des. In
norm
fact, ev clipping,
even the parameter up date vector
en simply taking a random step when the gradien norm is gradientt magnitude isounded
now b ounded. This b abov
abovee
gradien t avoids p erforming a detrimen tal step
a threshold tends to work almost as well. If the explosion is so sev when the gradien t explo
severe des.
ere that theIn
fact,
gradien
gradientevtenissimply taking a random
numerically or step when the
(considered gradien
infinite or tnot-a-n
magnitude
not-a-num um
umb is abov
ber), thene
aa threshold
random step tendsof to work
size almost
v can as well.
be taken andIfwill the texplosion
ypically mo is vso
mov e asev
way erefrom
that the
the
ngradien t is numerically
umerically or
unstable configuration. (considered
Clipping infinite or
the gradient normnot-a-n umber), then
per-minibatch will
a random step of size v can
not change the direction of the gradien b e taken and will typically
gradientt for an individual minibatc mo v e a
minibatch. w ay from
h. Ho
Howev
wevthe
wever,
er,
numerically
taking the aunstable
verage ofconfiguration.
the norm-clipped Clipping the gradient
gradient from man norm
many per-minibatch
y minibatches is will
not
not
equiv change
alent tothe
equivalent direction
clipping theofnorm
the gradien
of the truet for gradien
an individual
gradient minibatcformed
t (the gradient h. Howev er,
from
takingallthe
using average of
examples). the norm-clipped
Examples that havhavee gradient
large gradientfrom norm,
many as minibatches is not
well as examples
equiv
that appalent
appear to clipping the norm of the true gradien
ear in the same minibatch as such examples, will hav t (the gradient formed
havee their contribution from
using
to theall
finalexamples).
direction Examples
diminished. thatThishavstands
e largeingradient
con trastnorm,
contrast as well asminibatch
to traditional examples
that
gradien app
gradient t ear in thewhere
descent, samethe minibatch as such
true gradien
gradient examples,
t direction will hav
is equal to ethetheir
av contribution
average
erage ov
over
er all
to the
minibatc
minibatchfinal direction diminished.
h gradients. Put another wa This
way stands in con trast to traditional
y, traditional stochastic gradient descent uses minibatch
gradien t descent, where the true
an unbiased estimate of the gradient, while gradien t direction
gradient is equal
descen
descenttot with
the avnormerageclipping
over all
minibatc
in
intro
tro
troduces
ducesh gradients.
a heuristicPut biasanother
that wewakno y, traditional
know w empirically stochastic gradient
to be useful. descent
With uses
element-
an unbiased
wise clipping,estimate of the of
the direction gradient,
the update whileisgradient
not aligned descen
witht with
the norm clippingt
true gradien
gradient
intro
or theduces a heuristic
minibatc
minibatch bias that
h gradient, but we kno
it is w empirically
still a descen to be useful.
descentt direction. It With
has alsoelement-
been
wise
prop clipping,
proposed
osed (Grav the
Graves direction of the
es, 2013) to clip the bac update k-propagated gradient (with respect tot
is not
back-propagated aligned with the true gradien
or the minibatc
hidden units) but h gradient,
no comparisonbut ithas is still
beenapublished
descent direction.
betw
etween
een theseIt has also been
variants; we
proposed (that
conjecture Gravall
es, these
2013)metho
to clip
methodsdsthebehavbacek-propagated
ehave similarly
similarly.. gradient (with respect to
hidden units) but no comparison has been published between these variants; we
conjecture that all these methods behave similarly.

Gradien
Gradientt clipping helps to deal with explo exploding
ding gradients, but it do does
es not help with
vanishing gradients. To address vanishing gradients and better capture long-term
Gradien
dep t clipping
dependencies,
endencies, helps to deal
we discussed withofexplo
the idea ding paths
creating gradients,
in thebut it does not help
computational with
graph of
vthe
anishing gradients.
unfolded recurren T o address vanishing gradients
recurrentt architecture along which the pro and
productb etter capture long-term
duct of gradients associated
dep endencies, we discussed the idea
with arcs is near 1. One approach to ac of creating
achiev
hiev paths in the computational
hievee this is with LSTMs graphops
and other self-lo
self-loopsof
the unfolded recurren t architecture
and gating mechanisms, described ab along
abovov which the pro duct of gradients associated
ovee in Sec. 10.10. Another idea is to regularize
with arcs is near
or constrain the1.parameters
One approachso astotoacencourage
hieve this is“information
with LSTMsflow.” and other self-loops
In particular,
wand gatinglike
e would mechanisms,
the gradientdescribed
vectorab ∇ove L inbSec.
eing10.10 . Another ideatoismaintain
back-propagated to regularize
its
or constrain the parameters so as to encourage “information flow.”
magnitude, even if the loss function only penalizes the output at the end of the In particular,
w e would like
sequence. the gradient
Formally
ormally,, we wan
wantvector
t L being back-propagated to maintain its
magnitude, even if the loss function ∇ only penalizes the output at the end of the
sequence. Formally, we want ∂ h (t)
(∇ L) (t−1) (10.43)
∂h
∂h
( L) (10.43)
418∂ h

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

to b e as large as
∇ L. (10.44)
to b e as large as
With this obobjective,
jective, Pascan
Pascanu
u et al. (2013aL.
) prop
propose
ose the following regularizer:
(10.44)
With this ob jective, Pascanu et al. (2013a ∇ ) propose the following 2 regularizer:
| (∇ L) ∂∂ |
Ω= −1 . (10.45)
t ( ||∇ L ) L | |
Ω= | ∇ | 1 . (10.45)
L
Computing the gradiengradientt of this regularizer may app appear
− ear difficult, but Pascan Pascanu u
et al. (2013a) prop
propose
ose an appro
approximation ||∇ | |
ximation in which we consider the bac back-propagated
k-propagated
Computing the gradien t of this regularizer may app ear
vectors ∇ L as if they were constants (for the purpose of this regularizer, difficult, but Pascan sou
et al. there
that (2013ais) no
prop ose an
need to appro
bac ximation inthrough
back-propagate
k-propagate which wthem).
e consider Thetheexperiments
back-propagated
with
vthis
ectors L as if they were
regularizer suggest that, if com constants
combined (for the purpose of this regularizer,
bined with the norm clipping heuristic (which so
that there
handles ∇ is no need
gradient to bacthe
explosion), k-propagate
regularizer through them). The
can considerably experiments
increase with
the span of
this regularizer
the dep
dependencies suggest that, if com bined with the norm
endencies that an RNN can learn. Because it keeps the RNN dynamicsclipping heuristic (which
handles
on the edgegradient explosion),
of explosive the regularizer
gradients, can clipping
the gradient considerably increase the
is particularly span of
important.
the dependencies
Without that an RNN
gradient clipping, gradient canexplosion
learn. Because
preven
preventsit keeps
ts learning thefrom
RNN dynamics
succeeding.
on the edge of explosive gradients, the gradient clipping is particularly important.
A keygradient
Without weakness of this approac
clipping, approach
gradienthexplosion
is that it preven
is not ts
as learning
effective from
as thesucceeding.
LSTM for
tasks where data is abundant, such as language mo modeling.
deling.
A key weakness of this approach is that it is not as effective as the LSTM for
tasks where data is abundant, such as language modeling.
10.12 Explicit Memory

10.12
In telligenceExplicit
Intelligence Memory
requires knowledge and acquiring knowledge can be done via learning,
whic
which h has motiv
motivated
ated the developmen
developmentt of large-scale deep architectures. Ho Howev
wev
wever,er,
Intelligence
there requireskinds
are different knowledge and acquiring
of knowledge. Some knowledge
knowledge cancanbe done via learning,
be implicit, sub-
whic h has motiv
conscious, ated the to
and difficult developmen t of large-scale
verbalize—suc
verbalize—such h as how to deep architectures.
walk, or ho
howw a dogHowev
lookser,
there
differen
differentare different
t from a cat.kinds
Otherofknowledge
knowledge. canSome knowledge
be explicit, can be and
declarative, implicit,
relativsub-
relatively
ely
conscious,
straigh
straightforwardand difficult to verbalize—suc
tforward to put into words—ev
words—every ery dah
day as how to w alk, or ho w a dog
y commonsense knowledge, like “a cat looks
differen
is a kind t from a cat. Other
of animal,” or veryknowledge
sp
specific
ecific can
facts b e explicit,
that you need declarative,
to know to and relatively
accomplish
ystraigh tforward
our current to put
goals, likeinto
“thewords—ev ery dathe
meeting with y commonsense
sales team isknowledge,
at 3:00 PMlike in “a
ro
roomcat
om
is a kind of animal,” or very specific facts that you need to know to accomplish
141.”
your current goals, like “the meeting with the sales team is at 3:00 PM in room
Neural netw
networks
orks excel at storing implicit kno knowledge.
wledge. How
Howev ev
ever,
er, they struggle
141.”
to memorize facts. Sto Stochastic
chastic gradien
gradientt descent requires many presen presentations
tations of
Neural netw orks excel at storing implicit
the same input before it can be stored in a neural netw kno wledge.
network How ev er, they
ork parameters, and struggle
even
to memorize facts. Sto chastic
then, that input will not be stored esp gradien t descent
especially requires
ecially precisely many
precisely.. Gra
Graves presen tations
ves et al. (2014b of)
the
hyp same
ypothesizedinput b efore it can b e stored
othesized that this is because neural net in
netwa neural
works lac netw
lack ork
k the equivparameters,
equivalen
alen and even
alentt of the working
then, that
memory inputthat
system willallows
not be stored
human especially
beings precisely
to explicitly . and
hold Graves et al. (2014b
manipulate pieces)
hypinformation
of othesized that thisare
that is brelev
ecause
anttneural
relevan
an to ac networks
achieving
hieving lack goal.
some the equiv
Suc
Suchalen t of the memory
h explicit working
memory system that allows human beings to explicitly hold and manipulate pieces
of information that are relevant to achieving 419 some goal. Such explicit memory
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.18: A schematic of an example of a net network


work with an explicit memory
memory,, capturing
some of the key design elements of the neural Turing mac machine.
hine. In this diagram we
Figure 10.18:theA “representation”
distinguish schematic of an example
part of theof amo
netdel
modelwork with
(the an net
“task explicit
network,”
work,”memory
here a, capturing
recurrent
some of the key design elements of the neural T uring mac hine.
net in the b ottom) from the “memory” part of the model (the set of cells), whicIn this diagram
which we
h can
distinguish
store the “representation”
facts. The task netw
network parttoof“con
ork learns thetrol”
model
“control” the (the “task
memory
memory, network,”
, deciding heretoa read
where recurrent
from
net in
and the to
where b ottom) from
write to the the
within “memory”
memorypart of thethe
(through model (theand
reading set writing
of cells),mec
which can
mechanisms,
hanisms,
store facts.byThe
indicated task
bold netwp
arrows ork learnsattothe
ointing “con trol” the
reading andmemory , deciding
writing where to read from
addresses).
and where to write to within the memory (through the reading and writing mechanisms,
indicated by bold arrows pointing at the reading and writing addresses).

420
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

comp
componenonen
onentsts would allow our systems not only to rapidly and “inten “intentionally”
tionally” store
and retriev
retrievee sp
specific
ecific facts but also to sequentially reason with them. The need
comp
for onents
neural would
net worksallow
networks that canour systems
pro
process not only to rapidly
cess information and “inten
in a sequence tionally”
of steps, store
changing
and wa
the retriev
wayy thee input
specific facts
is fed in butthe
into
to also
netwto ork
network sequentially
at each step,reason haswith
long them. The need
been recognized
for neural net works
as important for the abilit that can
ability y to reason rather than to make automatic, changing
pro cess information in a sequence of steps, in
intuitiv
tuitiv
tuitivee
the
resp wa
responsesy the
onses to input
the inputis fed into the
(Hinton netw
, 1990 ). ork at each step, has long been recognized
as important for the ability to reason rather than to make automatic, intuitive
To resolve this difficulty
difficulty,, Weston et al. (2014) in intro
tro
troduced
duced memory networks that
responses to the input (Hinton, 1990).
include a set of memory cells that can be accessed via an addressing mechanism.
To resolve
Memory net this difficulty
networks
works originally , Wrequired
eston et al. (2014
a sup ) introsignal
supervision
ervision duced instructing
memory networks them that ho
how w
include
to a set memory
use their of memory cellsGrav
cells. thatescan
Graves be (accessed
et al. 2014b) in via an addressing
introduced
troduced the neur mechanism.
neural al Turing
Memory
machine,, net
machine works
which originally
is able to learnrequired
to read froma supand ervision
writesignal instructing
arbitrary conten
contentt to them
memoryhow
to use
cells their memory
without explicit sup cells.
supervisionGraves
ervision ab et
outal.which
about (2014b ) introduced
actions to undertak the e,
undertake, neurandal allo
Turing
allowed
wed
machine , which is able to
end-to-end training without this sup learn to read from
supervision and write arbitrary
ervision signal, via the use of a contenconten t to memory
content-based
t-based
cells atten
soft without
tionexplicit
attention mec
mechanism supervision
hanism about which
(see Bahdanau et al.actions
(2015)toand undertak e, and ).
Sec. 12.4.5.1 alloThis
wed
end-to-end
soft addressingtraining withouthas
mechanism thisbsup
ecomeervision signal,
standard via other
with the use of a conten
related t-based
architectures
soft
em attention
emulating
ulating mechanism
algorithmic (see Bahdanau
mechanisms in a way et that
al. (still
2015allows
) and Sec. 12.4.5.1
gradien
gradient-based
t-based ). This
opti-
soft addressing mechanism has b ecome
mization (Sukhbaatar et al., 2015; Joulin and Mikolo standard with
Mikolov other related
v, 2015; Kumar et al. architectures
al.,, 2015;
emulating
Vin
Vinyals algorithmic
yals et al.
al., mechanismsetinal.
, 2015a; Grefenstette a ,w2015
al., ay that
). still allows gradient-based opti-
mization (Sukhbaatar et al., 2015; Joulin and Mikolov, 2015; Kumar et al., 2015;
Eac
Each
Vinyals hetmemory
al., 2015a cell can be though
; Grefenstette thoughtet tal.of as an
, 2015 ). extension of the memory cells in
LSTMs and GRUs. The difference is that the netw network
ork outputs an in internal
ternal state
Eac h memory cell can be though t of as an extension
that chooses which cell to read from or write to, just as memory accesses of the memory cellsin in
a
LSTMscomputer
digital and GRUs. readThe from difference
or writeistothat a sp the
specific
ecificnetw ork outputs an internal state
address.
that chooses which cell to read from or write to, just as memory accesses in a
It iscomputer
digital difficult to readoptimize
from orfunctions
write to that pro
produce
a specific duce exact, integer addresses. To
address.
alleviate this problem, NTMs actually read to or write from many memory cells
sim It is difficult
simultaneously
ultaneously
ultaneously. . Tto optimize
o read, they functions
take a weigh that
weighted tedpro duce exact,
average of many integer
cells. addresses.
To write, they To
alleviate
mo
modify this problem,
dify multiple cells byNTMs different actually
amounts. read to The orco
write
coefficientsfrom for
efficients many memory
these op
operationscells
erations
simultaneously
are chosen to be . Tofocused
read, they on atake a weigh
small num
numb ted
ber aofverage
cells,offormany cells. T
example, byo write,
pro
producing they
ducing
modify
them viamultiple
a softmax cells by different
function. Usingamounts.
these weigh The
weights ts co efficients
with non-zero forderiv
theseativ
derivativ opes
atives erations
allows
are chosen to be focused on a small num b er of cells,
the functions controlling access to the memory to be optimized using gradien for example, by pro ducing
gradientt
them via
descen
descent. a softmax
t. The gradien
gradient function.
t on theseUsing co these weigh
coefficients
efficients ts with
indicates non-zero
whether eachderiv
of ativ
them es should
allows
the
b functionsorcontrolling
e increased decreased,access but the to gradient
the memory to be optimized
will typically be large using
only for gradien
thoset
descent. addresses
memory The gradien t on these
receiving coefficients
a large co
coefficient. indicates whether each of them should
efficient.
be increased or decreased, but the gradient will typically be large only for those
These memory cells are typically augmen augmented ted to con
contain
tain a vector, rather than
memory addresses receiving a large coefficient.
the single scalar stored by an LSTM or GRU memory cell. There are tw twoo reasons
These memory cells are typically augmen
to increase the size of the memory cell. One reason is that we ha ted to con tain a vector,
ve increasedthan
have rather the
the single scalar stored by an LSTM or GRU
cost of accessing a memory cell. We pay the computational cost of promemory cell. There are tw o reasons
producing
ducing a
to
co increase
coefficien
efficien the size of the
efficientt for many cells, but we exp memory cell.
expect One
ect these co reason is
coefficients that w e ha ve increased
efficients to cluster around a small the
cost
n umberof accessing
of cells. aBymemoryreadingcell. We pay
a vector the rather
value, computational
than a scalar cost ofvalue,
producing
we can a
coefficient for many cells, but we exp ect these coefficients to cluster around a small
number of cells. By reading a vector 421 value, rather than a scalar value, we can
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

offset some of this cost. Another reason to use vector-v vector-valued


alued memory cells is that
they alloallow w for content-b
ontent-base ase
asedd addr
addressing
essing
essing,, where the weight used to read to or write
offset asome
from cell ofis this cost. Another
a function of that reason
cell. Vto use vector-v
ector-v
ector-valued
alued cellsaluedallow
memory us tocells is that
retriev
retrieve ea
they allo w
complete vector-v for c ontent-b
ector-valued ase d addr essing
alued memory if we are able to pro, where the w eight
produce used to read to
duce a pattern that matches or write
from a cell is a function of that
some but not all of its elements. This is analogous to cell. V ector-v alued cells
theallow
way thatus topeople
retrievcanea
complete
recall the vlyrics
ector-v ofalued
a song memory
based on if we
a feware words.
able to W pro ducethink
e can a pattern that matches
of a conten
content-based
t-based
some but not
read instruction as sa all of its
saying,elements. This is analogous to the
ying, “Retrieve the lyrics of the song that has the chorus ‘W w ay that people cane
‘We
recall
all liveethe
liv in alyrics
yellow of submarine.’
a song based” on a few
Conten
Content-based words.addressing
t-based We can think is moreof auseful
conten t-based
when we
read
mak instruction
makee the ob objects as sa ying, “Retrieve
jects to be retrieved large—if ev the lyricsery letter of the song was stored in ae
everyof the song that has the chorus ‘W
all live inmemory
separate a yellowcell, submarine.’
we would” not Contenbe ablet-based addressing
to find them this is more
wa
way y. useful when we
By comparison,
mak
lo e the ase
loccation-b
ation-base obdjects
ased addr to
addressingbe retrieved
essing is not allow large—if
allowed ed to refereverytoletter of the tsong
the conten
content of thewas stored. in
memory
memory. Wae
separate
can thinkmemory
of a lo cell, we wouldread
location-based
cation-based not binstruction
e able to find as them
saying this way. By
“Retriev
“Retrieve comparison,
e the lyrics of
lo c ation-b ase d
the song in slot 347.” Loaddr essing is not
Location-based allow ed to refer to the conten
cation-based addressing can often be a perfectly sensiblet of the memory . We
can
mec think ofeven
mechanism
hanism a lowhen
cation-based
the memory read cells
instruction
are small. as saying “Retrieve the lyrics of
the song in slot 347.” Location-based addressing can often be a perfectly sensible
mecIfhanism
the con
contenten
tentt when
even of a memory
the memory cell is copied
cells are (not forgotten) at most time steps, then
small.
the information it contains can be propagated forward in time and the gradients
If the conbackw
propagated tent ofard
backward a memory
in time cell is copied
without either (not forgotten)
vanishing or at most
explo
exploding. time steps, then
ding.
the information it contains can be propagated forward in time and the gradients
The explicit memory approac approach h is illustrated in Fig. 10.18, where we see that
propagated backward in time without either vanishing or exploding.
a “task neural net network”
work” is coupled with a memory memory.. Although that task neural
net
netwowoThe
work explicit memory approac
rk could be feedforward or recurren h is
recurrent,illustrated
t, the ov in Fig.
overall
erall system10.18is ,awhere
recurren
recurrentwet see
netw that
network.ork.
a “task
The task netw neural
network net work” is coupled with
ork can choose to read from or write to sp a memory . Although
specific that task
ecific memory addresses. neural
network memory
Explicit could be seems feedforward
to allo
allowwormorecurren
dels tot,learn
models the ov erall that
tasks system is a recurren
ordinary RNNs tornetwLSTM ork.
The
RNNs task netwlearn.
cannot ork can One choose
reason toforreadthisfrom adv
advanortage
an write
antage mayto bspeecific
because memory addresses.
information and
Explicit
gradien
gradients memory seems to allo w mo dels
ts can be propagated (forward in time or backw to learn tasks that
backwards ordinary RNNs
ards in time, resp or LSTM
respectively)
ectively)
RNNs cannot
for very long durations.learn. One reason for this adv an tage may b e b ecause information and
gradients can be propagated (forward in time or backwards in time, respectively)
As anlong
for very alternative
durations. to back-propagation through weigh weighted ted av
averages
erages of memory
cells, we can interpret the memory addressing co coefficients
efficients as probabilities and
sto As an alternative to back-propagation
stocchastically read just one cell (Zaremba and Sutskev through
Sutskeverweigh
er, 2015ted).avOptimizing
erages of memory
mo
models
dels
cells,
that mak w e can interpret the
makee discrete decisions requires sp memory addressing
specialized co efficients as probabilities
ecialized optimization algorithms, describ and
described ed
stoSec.
in chastically
20.9.1.read just training
So far, one cell (these
Zaremba sto and Sutskev
stochastic
chastic er, 2015). that
architectures Optimizing models
make discrete
that makeremains
decisions discrete harder
decisions requires
than specialized
training optimization
deterministic algorithms,
algorithms describ
that make ed
soft
in Sec. 20.9.1. So far, training these stochastic architectures that make discrete
decisions.
decisions remains harder than training deterministic algorithms that make soft
Whether it is soft (allo (allowing
wing back-propagation) or sto stochastic
chastic and hard, the mech-
decisions.
anism for choosing an address is in its form iden identical
tical to the attention me mechanism
chanism
whic
which Whether
h had been previously introduced in the context of machine translation (mech-
it is soft (allo wing back-propagation) or sto chastic and hard, the Bah-
anism
danau et al.for choosing an address is in its form iden
al.,, 2015) and discussed in Sec. 12.4.5.1. The idea of atten tical to the attention
attention me chanism
tion mechanisms
whic h had
for neural netw b een
networks previously
orks was in introduced
intro
tro
troduced in the context of machine
duced even earlier, in the context of handwriting translation (Bah-
danau et al.(,Grav
generation 2015es
Graves ) and
, 2013 discussed
), with in anSec. 12.4.5.1
attention . The idea that
mechanism of atten
wastion mechanisms
constrained to
for neural networks was introduced even earlier, in the context of handwriting
generation (Graves, 2013), with an attention 422 mechanism that was constrained to
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

mo
move
ve only forw
forward
ard in time through the sequence. In the case of machine translation
and memory net networks,
works, at eac
eachh step, the fofocus
cus of attention can mov
movee to a completely
mo ve
differenonly forw ard in time through the sequence.
differentt place, compared to the previous step. In the case of machine translation
and memory networks, at each step, the focus of attention can move to a completely
Recurren
Recurrent
differen t neural
t place, netw
networks
compared orks pro
provide
to the vide a way
previous to extend deep learning to sequential
step.
data. They are the last mamajor
jor to
tool
ol in our deep learning toolb
toolboox. Our discussion now
mo Recurren
moves
ves to ho
howt neural
w netw
to choose orksuse
and prothese
vide atowols
ay and
tools to extend
ho
howw todeep
applylearning to real-world
them to sequential
data. They are the last ma jor tool in our deep learning toolb ox. Our discussion now
tasks.
moves to how to choose and use these tools and how to apply them to real-world
tasks.

423
Chapter 11
Chapter 11
Practical metho
methodology
dology
Practical metho dology
Successfully applying deep learning tec
techniques
hniques requires more than just a go
goood
kno
knowledge
wledge of what algorithms exist and the principles that explain ho how w they
Successfully
w ork. A go gooodapplying
mac hinedeep
machine learning
learning techniques
practitioner also requires
needs to more
kno
know w than
ho
how just
w to a gooan
choose d
knowledgefor
algorithm of awhat algorithms
particular exist and
application andho the
how w toprinciples
monitor andthatrespond
explaintoho w they
feedback
w ork. A go o
obtained from expd mac hine
experimen
erimen learning
eriments practitioner
ts in order to improv also needs to kno w ho w to
improvee a machine learning system. During choose an
algorithm
da
dayy to dayfor devaelopmen
particular
developmen
elopment t ofapplication and howsystems,
machine learning to monitor and respond
practitioners needto to
feedback
decide
obtained to
whether from experimen
gather ts in order
more data, to improv
increase e a machine
or decrease mo del learning
model capacity,,system.
capacity During
add or remo
remov ve
da y to day dev elopmen
regularizing features, impro t of
improv machine learning systems,
ve the optimization of a mo practitioners
model,
del, improv need
improvee appro to decide
approximate
ximate
whether to gather
inference in a mo more
model, data, increase
del, or debug the softw or
softwaredecrease
are implemen mo del
implementation capacity ,
tation of the mo add del. All vofe
or
model. remo
regularizing
these op features,
operations impro v e the optimization of a mo del,
erations are at the very least time-consuming to try out, so it is impimprov e appro ximate
important
ortant
inference
to be ableintoadetermine
model, or thedebug
right thecourse
softwofareaction
implemen tation
rather thanofblindly
the moguessing.
del. All of
these operations are at the very least time-consuming to try out, so it is important
Most of this bo ok is ab about
out different machine learning mo models,
dels, training algo-
to be able to determine the right course of action rather than blindly guessing.
rithms, and ob objective
jective functions. This ma may y give the impression that the most
imp Most
importan
ortan of this b
ortantt ingredien o ok is ab out different machine
ingredientt to being a machine learning exp learning
expert mowing
ert is kno dels, atraining
knowing algo-
wide variet
ariety
y
rithms,
of machine andlearning
objective
tec functions.
techniques
hniques and This
beingma go
gooyodgive the impression
at differen
different t kinds of that
math.the In most
prac-
importan
tice, one tcan
ingredien
usuallyt do
to bmuc
eingh abetter
much machinewithlearning
a correct exp ert is knowing
application of a acommonplace
wide variety
of machine learning tec hniques and b eing go o d at differen t
algorithm than by sloppily applying an obscure algorithm. Correct application kinds of math. In prac-
of
tice, one can
an algorithm dep usually
depends do muc h b etter with a correct
ends on mastering some fairly simple metho application
methodologyof
dologya commonplace
dology.. Man
Many y of the
algorithm than byin
recommendations sloppily applying
this chapter arean obscure
adapted algorithm.
from Ng (2015 Correct
). application of
an algorithm depends on mastering some fairly simple methodology. Many of the
We recommend the follo following
wing practical design pro process:
cess:
recommendations in this chapter are adapted from Ng (2015).
•WeDetermine
recommend thegoals—what
your following practical designtopro
error metric cess:
use, and your target value for
this error metric. These goals and error metrics should be driv driven
en by the
Determine your goals—what error
problem that the application is in metric
intended to use,
tended to solve. and your target value for
• this error metric. These goals and error metrics should b e driv en by the
• Establish a working
problem that end-to-end
the application pip
pipeline
is in eline to
tended as solve.
so
soon
on as p ossible, including the
424eline as soon as p ossible, including the
Establish a working end-to-end pip
• 424
CHAPTER 11. PRACTICAL METHODOLOGY

estimation of the appropriate performance metrics.

• Instrumen
estimationt of
Instrument the
the appropriate
system well topdetermine
erformancebottlenecks
metrics. in performance. Diag-
nose which comp
components
onents are performing worse than expexpected
ected and whether it
Instrumen
is due to ovt erfitting,
the system
overfitting, well to determine
underfitting, bottlenecks
or a defect in por
in the data erformance.
software. Diag-
software.
• nose which components are performing worse than expected and whether it
• Rep
is due to overfitting,
Repeatedly
eatedly mak underfitting,
makee incremental or asuch
changes defect
as in the datanew
gathering or softw
data,are.
adjusting
hyp
yperparameters,
erparameters, or changing algorithms, based on sp specific
ecific findings from
Rep
y oureatedly make incremental changes such as gathering new data, adjusting
instrumentation.
• hyperparameters, or changing algorithms, based on specific findings from
Asyour instrumentation.
a running example, we will use Street View address num umb ber transcription
system (Go Goo odfellow et al., 2014d). The purp purpose
ose of this application is to add
As a running
buildings to Go Googleexample, we will use Street View
ogle Maps. Street View cars photograph address
thenbuildings
umber transcription
and record
system (
the GPS co Go o dfellow
coordinates et
ordinates asso al. , 2014d
associated ).
ciated with eac The
each purp ose of this
h photograph. A conv application
olutionalisnet
convolutional towork
add
network
buildings tothe
recognizes Goaddress
ogle Maps.num
umb Street
ber inView
eachcars photograph
photograph, the
allo buildings
allowing
wing the Go and
ogle record
Google Maps
the GPS co ordinates asso ciated with
database to add that address in the correct lo each photograph.
cation. The story of howwork
location. A conv olutional net this
recognizes the address n um b er
commercial application was developed giv in each photograph,
gives allowing the Go ogle Maps
es an example of how to follow the design
database
metho
methodology to add
dology we adv that
advo address
ocate. in the correct location. The story of how this
commercial application was developed gives an example of how to follow the design
We no
now w describ
describee each of the steps in this pro process.
cess.
methodology we advocate.
We now describe each of the steps in this process.
11.1 Performance Metrics

11.1 Performance
Determining your goals, in terms Metrics
of whic
which h error metric to use, is a necessary first
step because your error metric will guide all of your future actions. Y You
ou should
Determining
also hav your goals, in terms of whic h error
havee an idea of what level of performance you desire. metric to use, is a necessary first
step because your error metric will guide all of your future actions. You should
alsoKeep
have inanmind
idea of that
whatfor level
mostofapplications,
performanceityou is imp
impossible
ossible to ac
desire. achiev
hiev
hievee absolute
zero error. The Bay Bayes es error defines the minimum error rate that you can hop hopee to
ac Keep
achiev
hiev
hieve, in
e, ev
even mind that
en if you hav for most applications, it is
havee infinite training data and can reco imp ossible
recov to ac hiev e absolute
ver the true probabilit
probabilityy
zero error. The Bay es
distribution. This is because yerror defines
your the minimum
our input features ma error
mayrate that
y not con you
contain can hop e
tain complete to
achiev e, ev
information aben if
aboutyou hav e infinite training data and can
out the output variable, or because the system migh reco v er the true probability
mightt be intrinsically
distribution.
stocchastic. YouThis
sto will is
alsobecause yourbyinput
be limited ha
havingfeatures
ving a finite ma y not
amoun
amount t ofcon tain complete
training data.
information about the output variable, or because the system might be intrinsically
The amount of training data can be limited for a variet arietyy of reasons. When your
stochastic. You will also be limited by having a finite amount of training data.
goal is to build the best possible real-world pro product
duct or service, you can typically
The amount of training data can b e limited
collect more data but must determine the value of reducing for a varietyerror
of reasons.
furtherWhen your
and weigh
goal against
this is to buildthethecostbest possible real-world
of collecting more data. proData
duct or service, can
collection you require
can typically
time,
collect
money more data but m ust determine the v alue of reducing
money,, or human suffering (for example, if your data collection pro error further
process and
cess in
invweigh
volves
this against
performing in the
inv cost of collecting more data. Data collection can
vasive medical tests). When your goal is to answer a scientific question require time,
money
ab
about , or h uman suffering (for example,
out which algorithm performs better on a fixed benc if your data collection
enchmark, pro
hmark, the benccess involves
enchmark
hmark
performing invasive medical tests). When your goal is to answer a scientific question
about which algorithm performs better on a fixed benchmark, the benchmark
425
CHAPTER 11. PRACTICAL METHODOLOGY

sp
specification
ecification usually determines the training set and you are not allow alloweded to collect
more data.
specification usually determines the training set and you are not allowed to collect
Ho
How w can one determine a reasonable lev level
el of performance to exp expect?
ect? Typically
Typically,,
more data.
in the academic setting, we hav havee some estimate of the error rate that is attainable
Ho w can one determine
based on previously published a reasonable
benchmark levelresults.
of performance to expect?setting,
In the real-word Typicallywe,
in vthe
ha
hav academic
e some idea setting,
of the errorwe havratee some
that estimate
is necessary of theforerror rate that is to
an application attainable
be safe,
based on
cost-effectiv
cost-effective, previously
e, or app published
appealing b enchmark
ealing to consumers. Once you ha results. In
hav the real-word setting,
ve determined your realistic we
have some
desired erroridearate, of ythe
our error
designrate that iswill
decisions necessary
be guided for byan reaching
application thisto be safe,
error rate.
cost-effective, or appealing to consumers. Once you have determined your realistic
Another
desired errorimpimportant
rate, ortant
your consideration
design decisions besides
will bthe targetby
e guided value of thethis
reaching performance
error rate.
metric is the choice of whic which h metric to use. Sev Several
eral different performance metrics
ma
may yAnother
be usedimp ortant consideration
to measure the effectiveness besidesof athe target v
complete alue of the that
application performance
includes
metric
mac
machine is the choice
hine learning comp of whic
components. h metric to use. Sev eral different
onents. These performance metrics are usually different p erformance metrics
ma y b e used to measure the
from the cost function used to train the mo effectiveness of a
model. complete
del. As describ application
described that includes
ed in Sec. 5.1.2, it is
machine to
common learning
measure comptheonents.
accuracy
accuracy, These
, or equiv performance
equivalently
alently
alently,, the metrics are usually
error rate, different
of a system.
from the cost function used to train the model. As described in Sec. 5.1.2, it is
commonHo
How wev
ever,
toer,measure
many applications
the accuracy require
, or equiv more adv
advanced
alently anced metrics.
, the error rate, of a system.
Sometimes it is muc much h more costly to make one kind of a mistake than another.
However, many applications require more advanced metrics.
For example, an e-mail spam detection system can make two kinds of mistak mistakes: es:
Sometimes
incorrectly it is mucah legitimate
classifying more costlymessage to makeasone kindand
spam, of aincorrectly
mistake than another.
allowing a
F or example,
spam message to app an e-mail
appear spam detection
ear in the in inb system
box. It is muc can
much make t
h worse to blow o kinds
block of mistak
ck a legitimate es:
incorrectly classifying a legitimate message
message than to allow a questionable message to pass through. Rather as spam, and incorrectly allowingthana
spam message
measuring to app
the error earofina the
rate spam inbclassifier,
ox. It iswe muc ma
mayhyworse
wish to to measure
block a some
legitimate
form
message
of total cost,thanwhereto allow a questionable
the cost of blo
blocking message to
cking legitimate pass through.
messages Rather
is higher than thethan
cost
measuring the error
of allowing spam messages. rate of a spam classifier, we ma y wish to measure some form
of total cost, where the cost of blocking legitimate messages is higher than the cost
Sometimes we wish to train a binary classifier that is intended to detect some
of allowing spam messages.
rare even
event. t. For example, we migh mightt design a medical test for a rare disease. Supp Suppose ose
Sometimes we wish to train a binary classifier
that only one in every million people has this disease. We can easily achiev that is intended to detect some
achievee
rare even t. F or example, we migh t design
99.9999% accuracy on the detection task, by simply hard-co a medical test for a rare
hard-coding disease. Supp
ding the classifier ose
that
to alwa only
always one
ys rep
report in every million
ort that the disease is absenp eople has
absent. this
t. Clearlydisease. W e can
Clearly,, accuracy is a poor wa easily achiev
way y toe
99.9999%
characterizeaccuracy on the detection
the p erformance of suc
such h a task,
system. byOne simplywa
way y hard-co
to solveding the classifier
this problem is to
to alwa ys rep
instead measure pr ort that the disease
preecision and recal is absen t. Clearly , accuracy
alll. Precision is the fraction of detections rep is a p o or wa y to
reported
orted
characterize
b y the mo model thethat
del p erformance
were correct,of sucwhile
h a system.
recall is One thewafraction
y to solve of this
trueproblem
ev
even
en
ents is to
ts that
instead measure pr e cision and rec al l . Precision
were detected. A detector that says no one has the disease would ac is the fraction of detections
achiev
hiev rep orted
hievee perfect
b y the mo del that were correct,
precision, but zero recall. A detector that sa while recallsaysis
ys evthe
every fraction
ery
eryone of true
one has the disease eventswould
that
were
ac
achiev
hiev
hievedetected.
e perfect A detector
recall, but that says no
precision onetohas
equal thethe disease would
percentage achievwho
of people e perfect
ha
haveve
precision, but zero recall. A detector that sa ys ev
the disease (0.0001% in our example of a disease that only one people in a million ery one has the disease would
ac
ha
havhiev
ve). eWhenperfectusing recall, but precision
precision and recall, equalittois thecommonpercentage
to plotofapPR eople who, ha
curve withve
the diseaseon(0.0001%
precision the y-axis in and
our example
recall onofthe a disease
x-axis. that The only one pgenerates
classifier eople in a amillion
score
have).is When
that higher usingif the precision
eventt to band
even recall, itoccurred.
e detected is common F
Forortoexample,
plot a PR curve, with
a feedforw
feedforward ard
precision on the y-axis and recall on the x-axis. The classifier generates a score
that is higher if the event to be detected 426 o ccurred. For example, a feedforward
CHAPTER 11. PRACTICAL METHODOLOGY

net
netwwork designed to detect a disease outputs yŷˆ = P (y = 1 | x), estimating the
probabilit
probability y that a person whose medical results are describ described ed by features x has
net w ork designed
the disease. We choose to repto detect a disease
report outputs y
ˆ = P (
ort a detection whenever this score y = 1 x), estimating
exceeds some the
probability By
threshold. that a person
varying thewhose medical
threshold, results
we can areprecision
trade described byrecall.
| for features x
In manyhas
the disease. W e choose to rep ort a detection whenever
cases, we wish to summarize the performance of the classifier with a single num this score exceeds some
numberber
threshold.
rather thanBy varyingTthe
a curve. o dothreshold,
so, we can weconcanvert
conv trade precision
precision p and for recall
recall.r In in many
into
to an
cases,
F-sc
F-scoror we wish
oree given by to summarize the p erformance of the classifier with a single num ber
rather than a curve. To do so, we can con 2prvert precision p and recall r into an
F-score given by F = . (11.1)
p+r
2pr
Another option is to rep reportort the total F = area . (11.1)
p +lying
r beneath the PR curve.
In some applications, it is possible for the mac machine
hine learning system to refuse to
Another option is to report the total area lying beneath the PR curve.
mak
makee a decision. This is useful when the mac machine
hine learning algorithm can estimate
ho
howwInconfident
some applications,
it should b iteisabpossible
about for the mac
out a decision, esphine
especially learning
ecially systemdecision
if a wrong to refusecan to
mak e a decision. This
be harmful and if a human op is useful when
operator the mac hine learning
erator is able to occasionally take ov algorithm over. can estimate
er. The Street
ho w confident it should b e ab out a decision,
View transcription system provides an example of this situation. The esp ecially if a wrong decision
task iscan to
b e harmful
transcrib
transcribe e theand if a human
address num
number beroperator
from a is able to occasionally
photograph in order to take asso over.the
associate
ciate Thelo Street
location
cation
View transcription
where the photo wassystem tak
takenen provides
with the an example
correct address of this
in asituation.
map. Because The task the vis to
alue
transcrib
of the map e the addressconsiderably
degrades number fromifathe photograph in order toitasso
map is inaccurate, ciate
is imp
importan the tloto
ortan
ortant cation
add
where the photo was tak en with the
an address only if the transcription is correct. If the maccorrect address in a map.
machine Because the
hine learning system value
of the map
thinks that degrades considerably
it is less likely than a human if the mapbeingis to inaccurate,
obtain theitcorrect is importan t to add
transcription,
an address
then the best only if the
course transcription
of action is to allow is correct.
a humanIftothe machine
transcrib
transcribe learning
e the system
photo instead.
thinks
Of thatthe
course, it ismachine
less likely than asystem
learning humanisbeing only to obtain
useful if ittheis correct
able to transcription,
dramatically
then the b est course of action
reduce the amount of photos that the human op is to allow a human to
operators transcrib
erators must pro e the photo
process. instead.
cess. A natural
Of
p course, the
erformance machine
metric to use learning
in thissystem
situation is only
is coveruseful
age.. ifCo
overage
age itvis
Cov ableistothe
erage dramatically
fraction of
reduce the amount of photos that the
examples for which the machine learning system is able to pro h uman op erators must pro
produce cess.
duce a resp A onse.
natural
response. It
p erformance metric
is possible to trade co to
cov use in this
verage for accuracy situation is c over
accuracy.. One can alwa age .
alwaysCo v erage is the
ys obtain 100% accuracy fraction of
examples for
by refusing to pro which
process the machine learning system
cess any example, but this reduces the cov is able to pro
coverage ducetoa 0%.
erage response.
For theIt
is possible
Street Viewtotask, tradethe covgoal
erage forfortheaccuracy
pro ject. was
project Onetocan alwahys
reach obtain el
uman-lev
uman-level 100% accuracy
transcription
by refusing
accuracy to pro
while maincess
maintainingany example,
taining 95% cov but this
coverage.
erage. reduces the
Human-lev
Human-level el pcov erage to 0%.
erformance on thisFortask
the
Street
is 98% Viewaccuracy
accuracy.task,
. the goal for the pro ject was to reach human-level transcription
accuracy while maintaining 95% coverage. Human-level performance on this task
Man
Many
is 98% y other .metrics are possible. We can for example, measure clic
accuracy click-through
k-through
rates, collect user satisfaction surveys, and so on. Man Many y sp specialized
ecialized application
areasManhav
havey eother metrics are
application-sp
application-specific ecificpossible.
criteriaWase canwell.for example, measure click-through
rates, collect user satisfaction surveys, and so on. Many sp ecialized application
What is imp important
ortant is to determine whic which h performance metric to improv improvee ahead
areas have application-specific criteria as well.
of time, then concen concentrate
trate on improving this metric. Without clearly defined goals,
What
it can be isdifficult
important is towhether
to tell determine which to
changes performance
a machinemetric learningto improv
systeme aheadmake
of time, then
progress or not. concen trate on improving this metric. Without clearly defined goals,
it can be difficult to tell whether changes to a machine learning system make
progress or not.
427
CHAPTER 11. PRACTICAL METHODOLOGY

11.2 Default Baseline Mo


Models
dels

11.2 cho
After Default
hoosing
osing performanceBaseline metricsMoand dels goals, the next step in any practical
application is to establish a reasonable end-to-end system as so soonon as possible. In
this section, we provide recommendations for which algorithms toinuse
After cho osing p erformance metrics and goals, the next step anyaspractical
the first
application is to establish a reasonable end-to-end system
baseline approach in various situations. Keep in mind that deep learning research as so on as p ossible. In
this section,
progresses we provide
quickly
quickly, , so betterrecommendations
default algorithms for which algorithms
are likely to become to use as the so
available first
soon
on
baseline approach
after this writing. in v arious situations. Keep in mind that deep learning research
progresses quickly, so better default algorithms are likely to become available soon
afterDep
Depending
thisending
writing.on the complexity of your problem, you may ev even
en wan antt to begin
without using deep learning. If your problem has a chance of being solv solveded by
justDep endinga on
choosing fewthe complexity
linear weigh
weights of your problem,
ts correctly
correctly, , you mayyou wan
wantmay
t to bevegin
en wwith
ant to begin
a simple
without using
statistical mo
model deep
del likelearning. If your problem has a chance of being solved by
logistic regression.
just choosing a few linear weights correctly, you may want to begin with a simple
If you know that your problem falls in into
to an “AI-complete” category like ob object
ject
statistical model like logistic regression.
recognition, sp speech
eech recognition, machine translation, and so on, then you are likely
If you
to do well know that yourwith
by beginning problem falls into an
an appropriate “AI-complete”
deep learning mo category
model.
del. like ob ject
recognition, speech recognition, machine translation, and so on, then you are likely
First,
to do well cbho
hoose
y ose the general
beginning with an category
appropriateof momodeldel based
deep on mo
learning thedel.
structure of your
data. If you wan antt to perform sup supervised
ervised learning with fixed-size vectors as input,
use First, choose the
a feedforward netw general
networkork with category of model based
fully connected lay ers.on
layers. If the
the structure
input has of your
known
data.
top If youstructure
topological
ological want to p(for erform supervised
example, if thelearning
input is with fixed-size
an image), usevectors
a conv as input,
convolutional
olutional
use
net
netwwaork.
feedforward network
In these cases, youwith
shouldfully connected
begin by using laysome
ers. Ifkindtheofinput has known
piecewise linear
top ological structure (for example,
unit (ReLUs or their generalizations lik if the input is an image), use
likee Leaky ReLUs, PreLus and maxout). If a conv o

You might also like