2016-12-07 NIPS Poster - Resnets Behave Like Ensembles

NIPS 2016
Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Andreas Veit Michael Wilber Serge Belongie
Experiments
Motivation
The traditional Computer Vision pipeline is
feed-forward and hierarchical
Higher level transformations depend only on the
output of the previous step
Low-level features lead to high level features
Deleting individual layers at test time
l
Giving neura
in
networks bra
s
damage help
d
us understan
k!
how they wor
Deleting and re-ordering multiple layers at test time

A
C
B
Residual Networks challenge this traditional view

Transformations can be re-arranged at test time
without much loss in performance
2 orders of magnitude more transformations
Identifying the effective path length
The Paths through Residual Networks

Neural Networks follow a sequential order.
Standard feed-forward CNN (VGG or AlexNet)
# paths per length * gradients per path length
Takeaways
However, Residual Networks introduce identity
shortcut connections that bypass individual blocks.
Original view
Unraveled view
Removing individual layers affects all paths in

traditional networks, but only half of the paths in
residual networks
Residual Network
- Residual Networks dont behave like traditional

CNNs.
- Their skip connections make them robust to
certain kinds of corruption, even when the network
has not been trained to compensate.
- Though Residual Networks are very deep, the
average length of the data flow path is
unexpectedly short, only a half-dozen layers
deep for long networks.
- This is only the start. There is still a long way to
understand the information flow in neural networks
and the implications of this study.
VGG
All paths affected
Only half of the paths are affected
Andreas Veit is a 3rd-year PhD

student working with Serge Belongie..
His current research interests include
Deep Learning, Computer Vision,
Machine Learning and
Human-in-the-Loop Computing. He is
also interested in applications
Michael Wilber is a 4rd-year PhD

student working with Serge Belongie.
His previous work includes using
crowdsourcing to capture perceptual
similarity metrics, for example to
understand food taste or the intuitive
notion of visual similarity between bird
NIPS 2016
Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Andreas Veit Michael Wilber Serge Belongie
The Unravelled View
Motivation
The traditional Computer Vision pipeline is
feed-forward and hierarchical
The equation describing each building block of a

Residual Network is:
We iteratively expand each step to achieve an
equivalent unraveled view:
Higher level transformations depend only on the

output of the previous step
Low-level features lead to high level features
Residual Networks challenge this traditional view:

1. Identity skip-connections bypass layers allowing data
to flow from any layer directly to any subsequent layer
Shown graphically, this

showcases the many paths
through Residual Networks. In
ordinary networks, deleting one
layer deletes the only viable
path. Deleting a single layer in
ResNet still leaves half of the
available paths intact.
VGG
2. Two orders of magnitude more transformations

3. Removing single layers from residual networks at
test time does not noticeably affect their performance
Gradient comes from short paths

Path length within a
Residual Network follows a
binomial distribution.
There are few short paths
and few long paths. Most
paths are of medium
length.
Few short paths
Many medium paths
Gradient during training

decreases exponentially
with increasing path
length. Long paths do not
contribute any gradient.
ResNet
Further, performance varies smoothly when deleting

several paths or reordering several layers.
Surprisingly, although trained jointly, the paths show a

high degree of independence.
Taken together, these

observations imply that
most gradient comes
from short paths during
training. Residual
Networks sidestep the
vanishing gradient problem
by creating many short
paths through the network.
Takeaways
- Each module depends on the output of all previous
modules, preventing co-adaptation between layers.
This is similar to Dropout, which prevents
co-adaptation by artificially changing the distribution
within each batch.
- ResNets incur some benefits of ensemble-like
systems
Few long paths
- Skip connections make Residual Networks robust to

certain kinds of corruption, even when the network
has not been trained to compensate.
- Though Residual Networks are very deep, the
average length of the data flow path is unexpectedly
short, only a half-dozen layers deep for long
networks.
- Overall: Residual Networks dont behave like
traditional CNNs.
- This is only the start. There is still a long way to
understand the information flow in neural networks
and the implications of this study.
Andreas Veit is a 3rd-year PhD student

working with Serge Belongie. His
current research interests include Deep
Learning, Computer Vision, Machine
Learning and Human-in-the-Loop
Computing. He is also interested in
applications concerning sustainability.
Michael Wilber is a 4th-year PhD

student working with Serge Belongie.
His previous work includes using
crowdsourcing to capture perceptual
similarity metrics, for example to
understand food taste or the intuitive
notion of visual similarity between bird
species.

2016-12-07 NIPS Poster - Resnets Behave Like Ensembles

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2016-12-07 NIPS Poster - Resnets Behave Like Ensembles

Uploaded by

Copyright:

Available Formats

NIPS 2016

Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Deleting individual layers at test time

Deleting and re-ordering multiple layers at test time

Residual Networks challenge this traditional view

The Paths through Residual Networks

# paths per length * gradients per path length

Removing individual layers affects all paths in

- Residual Networks dont behave like traditional

All paths affected

Only half of the paths are affected

Andreas Veit is a 3rd-year PhD

Michael Wilber is a 4rd-year PhD

Residual Networks Behave Like Ensembles of Relatively Shallow Networks

The Unravelled View

The equation describing each building block of a

Higher level transformations depend only on the

Residual Networks challenge this traditional view:

Shown graphically, this

2. Two orders of magnitude more transformations

Gradient comes from short paths

Few short paths

Many medium paths

Gradient during training

Further, performance varies smoothly when deleting

Surprisingly, although trained jointly, the paths show a

Taken together, these

Few long paths

- Skip connections make Residual Networks robust to

Andreas Veit is a 3rd-year PhD student

Michael Wilber is a 4th-year PhD

You might also like