You are on page 1of 38

Unsupervised Domain Adaptation by

Backpropagation
Yaroslav Ganin, Victor Lempitsky
July 8, 2015

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

1 / 20

Deep supervised neural networks

Figure
2: Anetillustration
Image credit:
Krizhevsky
al.

of the architecture of our CNN, explicitly showing the delineation of responsibilities


between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The networks input is 150,528-dimensional, and
are
a big
thing
computer
and
the
number
of neurons
in thein
networks
remainingvision
layers is given
by beyond
253,440186,62464,89664,89643,264
409640961000.

demand lots of labeled data

neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 5 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 3
256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

2 / 20

Where to get data?


Lots of modalities do not have large labeled data sets:
Biomedical
Unusual cameras or image types
Videos
Data requiring expert-level annotation

Image credit: Staal et al.

Surrogate training data are often available:


Borrow from adjacent modality
Generate synthetic imagery (computer graphics)
Use data augmentation to amplify number of
training samples

Image credit: Xu et al.

Resulting training data have a different distribution.


We need domain adaptation
from the source domain to the target domain.
Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

3 / 20

Where to get data?


Lots of modalities do not have large labeled data sets:
Biomedical
Unusual cameras or image types
Videos
Data requiring expert-level annotation

Image credit: Staal et al.

Surrogate training data are often available:


Borrow from adjacent modality
Generate synthetic imagery (computer graphics)
Use data augmentation to amplify number of
training samples

Image credit: Xu et al.

Resulting training data have a different distribution.


We need domain adaptation
from the source domain to the target domain.
Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

3 / 20

Where to get data?


Lots of modalities do not have large labeled data sets:
Biomedical
Unusual cameras or image types
Videos
Data requiring expert-level annotation

Image credit: Staal et al.

Surrogate training data are often available:


Borrow from adjacent modality
Generate synthetic imagery (computer graphics)
Use data augmentation to amplify number of
training samples

Image credit: Xu et al.

Resulting training data have a different distribution.


We need domain adaptation
from the source domain to the target domain.
Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

3 / 20

Example: the Oce dataset (Saenko, 2010)

Source: oce objects on white background

Target: photos of oce objects taken by a webcamera


Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

4 / 20

Example: synthetic to real

Source: rendered numbers

Source: rendered road signs

Target: SVHN

Target: GTSRB

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

5 / 20

Assumptions and goals

We have:
Lots of labeled data in the source domain (e.g. synthetic images)
Lots of unlabeled data in the target domain (e.g. real images)
We want to train a neural network that does well on the target domain.

Large-scale deep unsupervised domain adaptation.

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

6 / 20

Assumptions and goals

We have:
Lots of labeled data in the source domain (e.g. synthetic images)
Lots of unlabeled data in the target domain (e.g. real images)
We want to train a neural network that does well on the target domain.

Large-scale deep unsupervised domain adaptation.

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

6 / 20

Domain shift in a deep architecture


x

f = Gf (x; f )

y = Gy (f; y )

Unsupervised Domain Adaptation by Backpropagation


Label predictor
Input

Feature extractor

M: top feature extractor layer


When trained on source only,
feature distributions do not
match.


S(f) = Gf (x; f ) | x S(x)


T(f) = Gf (x; f ) | x T(x)

(b) Adapted
Yaroslav Ganin, Victor Lempitsky (Skoltech)

S YN N UMBERS ! SVHN: last hidde

(a) Non-adapted
Unsupervised DA by Backpropagation

July 8, 2015

7 / 20

Domain shift in a deep architecture


x

f = Gf (x; f )

y = Gy (f; y )

Unsupervised Domain Adaptation by Backpropagation


Label predictor
Input

Feature extractor

M: top feature extractor layer


When trained on source only,
feature distributions do not
match.


S(f) = Gf (x; f ) | x S(x)


T(f) = Gf (x; f ) | x T(x)

(b) Adapted
Yaroslav Ganin, Victor Lempitsky (Skoltech)

S YN N UMBERS ! SVHN: last hidde

(a) Non-adapted
Unsupervised DA by Backpropagation

July 8, 2015

7 / 20

Domain shift in a deep architecture


x

f = Gf (x; f )

y = Gy (f; y )

Unsupervised Domain Adaptation by Backpropagation


Label predictor
Input

Feature extractor

M: top feature extractor layer


When trained on source only,
feature distributions do not
match.


S(f) = Gf (x; f ) | x S(x)


T(f) = Gf (x; f ) | x T(x)

(b) Adapted
Yaroslav Ganin, Victor Lempitsky (Skoltech)

S YN N UMBERS ! SVHN: last hidde

(a) Non-adapted
Unsupervised DA by Backpropagation

July 8, 2015

7 / 20

Domain shift in a deep architecture


x

f = Gf (x; f )

main Adaptation by Backpropagation


Feature extractor

yer

y = Gy (f; y )

Label predictor

Input

S YN N UMBERS ! SVHN: last hidden layer of the label predictor

When trained on source only,


feature distributions do not
match.

Our goal is to get this:


(a) Non-adapted
Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

(b) Adapted
July 8, 2015

7 / 20

Our method: meet the domain classier

Domain classier

Computes d = Gd (f; d )
Unsupervised
Backpropagation
Is trained to
predict Domain
0 for Adaptation
source by
and
1 for target
xtractor!
layer
S YNfeature
N UMBERS
!layer
SVHN: lastShidden
layer of!the
label last
predictor
MNIST
MNIST-M:
top
extractor
YN
N
UMBERS
SVHN:
hidden layer of the label predictor
Therefore, the domain loss

rvised Domain Adaptation by Backpropagation

is low for

is higher for
(a) Non-adapted

(a) Non-adapted
(b) Adapted
(a)
Non-adapted
(b) Adapted
) Adapted
Yaroslav Ganin, Victor Lempitsky (Skoltech)
Unsupervised DA by Backpropagation

(b) Adapted
July 8, 2015

8 / 20

Our method: meet the domain classier

Domain classier

Computes d = Gd (f; d )
Unsupervised
Backpropagation
Is trained to
predict Domain
0 for Adaptation
source by
and
1 for target
xtractor!
layer
S YNfeature
N UMBERS
!layer
SVHN: lastShidden
layer of!the
label last
predictor
MNIST
MNIST-M:
top
extractor
YN
N
UMBERS
SVHN:
hidden layer of the label predictor
Therefore, the domain loss

rvised Domain Adaptation by Backpropagation

is low for

is higher for
(a) Non-adapted

(a) Non-adapted
(b) Adapted
(a)
Non-adapted
(b) Adapted
) Adapted
Yaroslav Ganin, Victor Lempitsky (Skoltech)
Unsupervised DA by Backpropagation

(b) Adapted
July 8, 2015

8 / 20

Our method: meet the domain classier

Domain classier

Computes d = Gd (f; d )
Unsupervised
Backpropagation
Is trained to
predict Domain
0 for Adaptation
source by
and
1 for target
xtractor
layer
S
YN
N
UMBERS
!
SVHN:
last
hidden
layer
label last
predictor
the extractor
domain
MNIST ! Therefore,
MNIST-M: top feature
layerloss S YN N UMBERSof!the
SVHN:
hidden layer of the label predictor

rvised Domain Adaptation by Backpropagation

is low for

(a) Non-adapted

) Adapted

is higher for
(b) Adapted
(a)
Non-adapted

Yaroslav Ganin, Victor Lempitsky (Skoltech)

(a) Non-adapted

(b) Adapted

Unsupervised DA by Backpropagation

(b) Adapted
July 8, 2015

8 / 20

How to train that thing?


Ly
y

Ly

Ly
f

y
Which label?

Which domain?

Ld
f
Ld
y

Ld

Lets try standard backpropagation. Emerging features are:


Discriminative (i.e. good for predicting y)
Domain-discriminative (i.e. good for predicting d)

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

9 / 20

How to train that thing?


Ly
y

Ly

Ly
f

y
Which label?

Which domain?

Ld
f
Ld
d

Ld

Lets try standard backpropagation. Emerging features are:


Discriminative (i.e. good for predicting y)
Domain-discriminative (i.e. good for predicting d)

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

9 / 20

How to train that thing?


Ly
y
Ly
f

Ly
y

Which label?

f
GRL

Which domain?

d
L
f

Ld
d

Ld

Lets now inject the Gradient Reversal Layer:


Copies data without change at fprop
Multiplies deltas by at bprop

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

9 / 20

How to train that thing?


Ly
y
Ly
f

Ly
y

Which label?

f
GRL

Which domain?

d
L
f

Ld
d

Ld

Emerging features are now:


Discriminative (i.e. good for predicting y)
Domain-invariant (i.e. not good for predicting d)

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

9 / 20

Few lines of code

import numpy as np
def GradientReversalLayer:
def __init__(self, lambda):
self.lambda = lambda
def fprop(self, input_blob, output_blob):
np.copy(input_blob.data, output_blob.data)
def bprop(self, input_blob, output_blob):
np.multiply(output_blob.diff,
-self.lambda,
out=input_blob.diff)

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

10 / 20

Saddle point interpretation

Our objective is
E(f , y , d ) =

Liy (f , y )

i=1..N
di =0

Lid (f , d )

i=1..N

The backpropagation converges to a saddle point:


(f , y ) = arg min E(f , y , d )
f ,y

d = arg max E(f , y , d )


d

Similar idea in Generative Adversarial Networks (Goodfellow et al., 2014).

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

11 / 20

The workow overview

Train feature extractor + label predictor on source

Train feature extractor + domain classier on source + target

Use feature extractor + label predictor at test time

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

12 / 20

The workow overview

Train feature extractor + label predictor on source

Train feature extractor + domain classier on source + target

Use feature extractor + label predictor at test time

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

12 / 20

The workow overview

Train feature extractor + label predictor on source

Train feature extractor + domain classier on source + target

Use feature extractor + label predictor at test time

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

12 / 20

The workow overview

Train feature extractor + label predictor on source

Train feature extractor + domain classier on source + target

Use feature extractor + label predictor at test time

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

12 / 20

Other approaches for deep DA

Recently, several deep approaches have been proposed:


Deep Domain Adaptation Network (DDAN) (Chen et al., 2015):
minimization of weighted Euclidean distance between features of
matching examples from both domains
Deep Domain Confusion (DDC) (Tzeng et al., 2014) and Deep
Adaptation Networks (DAN) (Long et al., 2015): minimization of
intra-batch maximum mean discrepancy (MMD) between source and
target features. Next talk!
Domain-adversarial neural networks (DANN) (Ajakan et al., 2014)
(concurrent effort): shallow version of our approach; joint paper
(Domain-Adversarial Training of Neural Networks) currently in
review

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

13 / 20

Other approaches for deep DA

Recently, several deep approaches have been proposed:


Deep Domain Adaptation Network (DDAN) (Chen et al., 2015):
minimization of weighted Euclidean distance between features of
matching examples from both domains
Deep Domain Confusion (DDC) (Tzeng et al., 2014) and Deep
Adaptation Networks (DAN) (Long et al., 2015): minimization of
intra-batch maximum mean discrepancy (MMD) between source and
target features. Next talk!
Domain-adversarial neural networks (DANN) (Ajakan et al., 2014)
(concurrent effort): shallow version of our approach; joint paper
(Domain-Adversarial Training of Neural Networks) currently in
review

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

13 / 20

Other approaches for deep DA

Recently, several deep approaches have been proposed:


Deep Domain Adaptation Network (DDAN) (Chen et al., 2015):
minimization of weighted Euclidean distance between features of
matching examples from both domains
Deep Domain Confusion (DDC) (Tzeng et al., 2014) and Deep
Adaptation Networks (DAN) (Long et al., 2015): minimization of
intra-batch maximum mean discrepancy (MMD) between source and
target features. Next talk!
Domain-adversarial neural networks (DANN) (Ajakan et al., 2014)
(concurrent effort): shallow version of our approach; joint paper
(Domain-Adversarial Training of Neural Networks) currently in
review

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

13 / 20

Other approaches for deep DA

Recently, several deep approaches have been proposed:


Deep Domain Adaptation Network (DDAN) (Chen et al., 2015):
minimization of weighted Euclidean distance between features of
matching examples from both domains
Deep Domain Confusion (DDC) (Tzeng et al., 2014) and Deep
Adaptation Networks (DAN) (Long et al., 2015): minimization of
intra-batch maximum mean discrepancy (MMD) between source and
target features. Next talk!
Domain-adversarial neural networks (DANN) (Ajakan et al., 2014)
(concurrent effort): shallow version of our approach; joint paper
(Domain-Adversarial Training of Neural Networks) currently in
review

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

13 / 20

Results: the Oce dataset (Saenko, 2010)

Source: oce objects on white background

Target: photos of oce objects taken by a webcamera


Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

14 / 20

Results: the Oce architecture

fully-conn
31 units
Soft-max

fully-conn
256 units
ReLU
Beheaded AlexNet

GRL

Yaroslav Ganin, Victor Lempitsky (Skoltech)

fully-conn
1024 units
ReLU

Unsupervised DA by Backpropagation

fully-conn
1024 units
ReLU

fully-conn
1 unit
Logistic

July 8, 2015

15 / 20

Results: the Oce dataset (cont.)


SOURCE

AMAZON

DSLR

WEBCAM

TARGET

WEBCAM

WEBCAM

DSLR

GFK(PLS, PCA) (GONG ET AL., 2013)

.197

.497

.631

SA (FERNANDO ET AL., 2013)

.450

.648

.699

DLID (S. CHOPRA & GOPALAN, 2013)

.519

.782

.899

DDC (TZENG ET AL., 2014)

.618

.950

.985

DAN (LONG & WANG, 2015)

.685

.960

.990

SOURCE

ONLY

.642

.961

.978

PROPOSED APPROACH

.730

.964

.992

METHOD

Protocol: all of the methods above use


all available labeled source samples
all available unlabeled target samples
Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

16 / 20

Further experiments: baselines

Upper bound: training on the target domain with labels


Shallow DA baseline: Subspace Alignment (Fernando et al., 2013)
Lower bound: training on the source domain only
We use features extracted at the penultimate layer of the label predictor.

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

17 / 20

Further experiments: synthetic to real


0.92
0.92

Source: rendered numbers

Accuracy

0.91
0.9

0.88

0.87

0.86

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

un
d

bo
er

SA

ur
O
pp
U

Target: SVHN

Lo
w
er

bo

un

0.86

July 8, 2015

18 / 20

Further experiments: larger gap


0.99

Source: SVHN

Accuracy

0.9
0.8

0.74

0.7
0.6

0.59
0.55

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

un
d

bo
er

SA

ur
O
pp
U

Target: MNIST

Lo
w
er

bo

un

0.5

July 8, 2015

19 / 20

Conclusion

Scalable method for deep unsupervised domain adaptation


Based on a simple idea; takes few lines of code
State-of-the-art results
Relatively easy to tune (look at the domain classier error)
Straightforward semi-supervised extension

Source code available at:


http://sites.skoltech.ru/compvision/projects/grl/

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

20 / 20

Conclusion

Scalable method for deep unsupervised domain adaptation


Based on a simple idea; takes few lines of code
State-of-the-art results
Relatively easy to tune (look at the domain classier error)
Straightforward semi-supervised extension

Source code available at:


http://sites.skoltech.ru/compvision/projects/grl/

Yaroslav Ganin, Victor Lempitsky (Skoltech)

Unsupervised DA by Backpropagation

July 8, 2015

20 / 20