You are on page 1of 2

NIPS 2016

Residual Networks Behave Like Ensembles of Relatively Shallow Networks


Andreas Veit Michael Wilber Serge Belongie

Experiments

Motivation
The traditional Computer Vision pipeline is
feed-forward and hierarchical
Higher level transformations depend only on the
output of the previous step
Low-level features lead to high level features

Deleting individual layers at test time

l
Giving neura
in
networks bra
s
damage help
d
us understan
k!
how they wor

Deleting and re-ordering multiple layers at test time


A

C
B

Residual Networks challenge this traditional view


Transformations can be re-arranged at test time
without much loss in performance
2 orders of magnitude more transformations
Identifying the effective path length

The Paths through Residual Networks


Neural Networks follow a sequential order.
Standard feed-forward CNN (VGG or AlexNet)

# paths per length * gradients per path length

Takeaways
However, Residual Networks introduce identity
shortcut connections that bypass individual blocks.
Original view

Unraveled view

Removing individual layers affects all paths in


traditional networks, but only half of the paths in
residual networks
Residual Network

- Residual Networks dont behave like traditional


CNNs.
- Their skip connections make them robust to
certain kinds of corruption, even when the network
has not been trained to compensate.
- Though Residual Networks are very deep, the
average length of the data flow path is
unexpectedly short, only a half-dozen layers
deep for long networks.
- This is only the start. There is still a long way to
understand the information flow in neural networks
and the implications of this study.

VGG

All paths affected

Only half of the paths are affected

Andreas Veit is a 3rd-year PhD


student working with Serge Belongie..
His current research interests include
Deep Learning, Computer Vision,
Machine Learning and
Human-in-the-Loop Computing. He is
also interested in applications

Michael Wilber is a 4rd-year PhD


student working with Serge Belongie.
His previous work includes using
crowdsourcing to capture perceptual
similarity metrics, for example to
understand food taste or the intuitive
notion of visual similarity between bird

NIPS 2016

Residual Networks Behave Like Ensembles of Relatively Shallow Networks


Andreas Veit Michael Wilber Serge Belongie

The Unravelled View

Motivation
The traditional Computer Vision pipeline is
feed-forward and hierarchical

The equation describing each building block of a


Residual Network is:
We iteratively expand each step to achieve an
equivalent unraveled view:

Higher level transformations depend only on the


output of the previous step
Low-level features lead to high level features

Residual Networks challenge this traditional view:


1. Identity skip-connections bypass layers allowing data
to flow from any layer directly to any subsequent layer

Shown graphically, this


showcases the many paths
through Residual Networks. In
ordinary networks, deleting one
layer deletes the only viable
path. Deleting a single layer in
ResNet still leaves half of the
available paths intact.

VGG

2. Two orders of magnitude more transformations


3. Removing single layers from residual networks at
test time does not noticeably affect their performance

Gradient comes from short paths


Path length within a
Residual Network follows a
binomial distribution.
There are few short paths
and few long paths. Most
paths are of medium
length.

Few short paths

Many medium paths

Gradient during training


decreases exponentially
with increasing path
length. Long paths do not
contribute any gradient.
ResNet

Further, performance varies smoothly when deleting


several paths or reordering several layers.

Surprisingly, although trained jointly, the paths show a


high degree of independence.

Taken together, these


observations imply that
most gradient comes
from short paths during
training. Residual
Networks sidestep the
vanishing gradient problem
by creating many short
paths through the network.

Takeaways
- Each module depends on the output of all previous
modules, preventing co-adaptation between layers.
This is similar to Dropout, which prevents
co-adaptation by artificially changing the distribution
within each batch.
- ResNets incur some benefits of ensemble-like
systems

Few long paths

- Skip connections make Residual Networks robust to


certain kinds of corruption, even when the network
has not been trained to compensate.
- Though Residual Networks are very deep, the
average length of the data flow path is unexpectedly
short, only a half-dozen layers deep for long
networks.
- Overall: Residual Networks dont behave like
traditional CNNs.
- This is only the start. There is still a long way to
understand the information flow in neural networks
and the implications of this study.

Andreas Veit is a 3rd-year PhD student


working with Serge Belongie. His
current research interests include Deep
Learning, Computer Vision, Machine
Learning and Human-in-the-Loop
Computing. He is also interested in
applications concerning sustainability.

Michael Wilber is a 4th-year PhD


student working with Serge Belongie.
His previous work includes using
crowdsourcing to capture perceptual
similarity metrics, for example to
understand food taste or the intuitive
notion of visual similarity between bird
species.

You might also like