Professional Documents
Culture Documents
WITH META-LEARNING
Hugo Larochelle
Google Brain
2
A RESEARCH AGENDA
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ practically, is that really how we’ll solve AI ?
‣ scientifically, this means there is a gap with ability of humans to learn, which we should try to understand
• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
2
A RESEARCH AGENDA
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ practically, is that really how we’ll solve AI ?
‣ scientifically, this means there is a gap with ability of humans to learn, which we should try to understand
• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
e of meta-learning setup. The top represents the meta-training set Dmet
gray box is a separate dataset that consists of the training set D (lef
4
People are
good at it
e indistinguishable from human behavior. 5
advances in artificial
Human-level
from just one or a handful of examples, whereas
concept
1
learning
Center for Data Science, New York University, 726
Broadway, New York, NY 10003, USA. 2Department of
achine learning,People
two
onceptual knowledge
standard
arealgorithms through
in machine learning require
tens or hundreds of examples to perform simi-
probabilistic
Computer Science and Department of Statistics, University
of Toronto, 6 King’s College Road, Toronto, ON M5S 3G4,
good at Forit instance, people may only need to see
ne systems. First, for
ds of natural and man-
larly.
program
one example of a novel two-wheeled vehicle
induction
Canada. 3Department of Brain and Cognitive Sciences,
Massachusetts Institute of Technology, 77 Massachusetts
Avenue, Cambridge, MA 02139, USA.
n learn a new concept (Fig. 1A) in order to grasp the boundaries of the *Corresponding author. E-mail: brenden@nyu.edu
Brenden M. Lake, * Ruslan Salakhutdinov,2 Joshua B. Tenenbaum3
1
People learning new concepts can often generalize successfully from just a
Machines are
yet machine learning algorithms typically require tens or hundreds of exam
perform with similar accuracy. People can also use learned concepts in ric
getting
conventional algorithms—for action, imagination, and explanation. We pres
computational model that captures these human learning abilities for a lar
better at it
simple visual concepts: handwritten characters from the world’s alphabets
represents concepts as simple programs that best explain observed examp
Bayesian criterion. On a challenging one-shot classification task, the mode
human-level performance while outperforming recent deep learning approa
present several “visual Turing tests” probing the model’s creative generaliz
which in many cases are indistinguishable from human behavior.
D
espite remarkable advances in artificial from just one or a handful of ex
intelligence and machine learning, two standard algorithms in machine
aspects of human conceptual knowledge tens or hundreds of examples
have eluded machine systems. First, for larly. For instance, people may
most interesting kinds of natural and man- one example of a novel two-
7
A RESEARCH AGENDA
• Let’s attack directly the problem of few-shot learning
‣ we want to design a learning algorithm A that outputs a good parameters 𝜽
L
of a model M, when fed a small dataset Dtrain={(xi,yi)}i=1
‣ …some have even reported some positive transfer on medical imaging datasets!
• In
few-shot learning, we aim at transferring the complete training of the model on
new datasets (not just transferring the features or initialization)
‣ ideally there should be no human involved in producing a model for new datasets
9
‣ Object classification from a single example utilizing class relevance pseudo-metrics (2004)
Michael Fink
gt-2 gt-1 gt
∇t-2 ∇t-1 ∇t
Optimizer m m m
ht-2 ht-1 ht ht+1
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas
• Learning to 2.1
learn using gradient
Coordinatewise descent (2001)
LSTM optimizer
Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell
One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of
thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it
12
LearnLng rate
the initialization conditions 5
Layer 3
4
Layer 4
3
2
rning applications, only a few hyper-
1
20) are optimized. Since each ex-
0
single number (the validation loss), 0 20 40 60 80 100
comes more difficult as the dimen- 6chedule Lndex
meter vector increases. In contrast,
Figure 2. A learning-rate training schedule for the weights in each
re available, the amount of informa- layer of a neural network, optimized by hypergradient descent.
training run grows along with the The optimized schedule starts by takinglearning
large steps only in the
• Gradient-based hyperparameter optimization through reversible (2015)
meters, allowing us toDavid
Dougal Maclaurin, optimize thou-
Duvenaud, and Ryan Ptopmost
Adams layer, then takes larger steps in the first layer. All layers
13
ral Architecture Search, we use a controller to generate architectural hyperparame
RELATED WORK: META-LEARNING
networks. To be flexible, the controller is implemented as a recurrent neural network
e we •would
AutoMLlike to predict
(Bayesian feedforward
optimization, neural networks
reinforcement learning)with only convolutional lay
the controller to generate their hyperparameters as a sequence of tokens:
2
META-LEARNING
TASK D ESCRIPTION
• Learning algorithm
We firstA
begin by detailing the meta-learning formulation we use. In the typical mach
setting, we are interested in a dataset D and usually split D so that we optimize param
‣ input: training set Dtrain={(xi,yi)}
training set Dtrain and evaluate its generalization on the test set Dtest . In meta-learnin
Figure
‣ output: 1: we are dealing
Computational
parameters 𝜽 model M withgraph
(the meta-sets
for theDforward
learner) containing
passmultiple regular datasets,
of the meta-learner. Thewhere eachline
dashed D 2div
D
examples of the and
Dtrain
from . Dtrain and test set Dtest . Each (Xi , Yi ) is the ith batch from
Dtestset
training
‣ objective: goodset
training performance
whereas on
(X, test
Y) set
is D
all ={(x
the ’i,y’i)}
elements
We consider the k-shot, N -class classification
test from thetask, test set.
whereThefordashed arrows D,
each dataset indicate tha
the train
do not back-propagate
sists of k labelledthrough that step
examples whenoftraining
for each N classes, the meaning
meta-learner.that DWe refer to the learn
train consists of k · N
M , whereand M (X;
Dtest is the
✓)has output
a set number of learner
of examples M using parameters ✓ for inputs X. We also use r
for evaluation.
a shorthandalgorithm
• Meta-learning for r✓t 1 Lt .
In meta-learning, we thus have different meta-sets for meta-training, meta-validation
(n) (n) N
testing
‣ input: meta-training set (D meta train ={(D, D metatrain,D test)}n=1 ,ofand
validation Dmeta test , respectively). On Dmeta tr
episodes
interested in training a learning procedure (the meta-learning model) that can take as i
‣ output: parameters 𝝝 algorithm
its training sets DA train
(the meta-learner)
and produce a model that achieves high average classification perf
to have training conditionstest
its corresponding matchset Dthose . of testDtime. During evaluation
Using we can of the hyper-paramet
perform meta-learning
test meta (n) validation
(n) N’
each dataset
‣ objective: good ofperformance
D = (D on meta-test
the meta-learning
train , D model
test ) 2
set andmeta
D evaluate , a
test={(D good
’ meta-learner
’
its generalization model will,
train,D test)}n=1performance on Dgiven a seri.
meta test
learner gradients and losses on the training set Dtrain , suggest a series of updates for the lea
model thatFor this itformulation
trains towards good to correspond
performancetoon thethefew-shot
test set D learning
test . setting, each training set
D 2 D will contain few labeled examples (we consider k = 1 or k = 5), that mus
16
META-LEARNING
1: Example of meta-learning setup. The top represents the meta-training set Dmeta train ,
nside each gray box is a separate dataset that consists of the training set Dtrain (left side of
line) and the test set Dtest (right side of dashed line). In this illustration, we are considering
17
META-LEARNING
Dtrain Dtest
=
episode
18
META-LEARNING
Dtrain Dtest
=
episode
18
META-LEARNING
Dtrain Dtest
=
episode
Meta-learner (A)
18
META-LEARNING
Dtrain Dtest
=
episode
META-LEARNING
Dtrain Dtest
=
episode
Loss
META-LEARNING
Dtrain Dtest
=
episode
Loss
META-LEARNING
Dtrain Dtest
=
episode
Loss
META-LEARNING NOMENCLATURE
Training set
Test set
Meta-training set
Meta-test set
20
META-LEARNING NOMENCLATURE
Episode
{ Training set
Test set
Meta-training set
Meta-test set
20
META-LEARNING NOMENCLATURE
Episode
{ Training set
Test set
Support set
Query set
Meta-training set
Meta-test set
20
META-LEARNING NOMENCLATURE
Episode
{ Training set
Test set
Support set
Query set
1 X
0 0
C(⇥; Dtrain , Dtest ) = log p⇥ (yi |xi , Dtrain )
|Dtest | 0 0
(xi ,yi )
2Dtest
CHOOSING A META-LEARNER
• How to parametrize learning algorithms?
CHOOSING A META-LEARNER
• How to parametrize learning algorithms?
X
minimizing 1the negative log-probability J( ) = log p (y = k | x) of 2the c
c =
Training episodes f (x ) (1)
|Sk | are formed by randomly selecting a subset of classes from
k i
hoosing a subset(xof i ,yexamples
i )2Sk within each class to act as thec1 support set and a x
M Sk M = {(x , y )|y = k, (x , y ) 2 D
R ⇥ R ! [0, +1), Prototypical Networks produce a distribution }
i i i i i train
c3
x based on a softmax over distances
2 to the prototypes in the embedding
⌘⇥
(a) Few-shot
exp( d(f (x), ck ))
| x) = P Networks for Few-shot Learning
(y =•kPrototypical Figure (2) in the few-shot
1: Prototypical networks
(2017)
exp(
Jake Snell, Kevin Swersky
k 0 and d(f
Richard (x),
Zemel c k 0 )) ck are computed as the mean of embedded su
26
PROTOTYPICAL NETWORKS
• Training a “prototype extractor”
‣ distance function d(·, ·) can be anything (euclidean squared, negative cosine)
‣ if distance is euclidean squared, equivalent to learning an
embedding network f (·) such that a Gaussian classifier
works well
‣ prototype vectors are equivalent to output weights of
c2
a neural network
‣ Snell et al. find that using more classes in the meta-training
c1 x
episodes compared to meta-testing works better
c3
(a) Few-shot
• Prototypical Figure
Networks for Few-shot Learning 1: Prototypical networks in the few-shot
(2017)
Jake Snell, Kevin Swersky and Richard Zemel ck are computed as the mean of embedded su
Under review as a conference paper at ICLR 2017
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t , and the 27
M ODELcandidate D ESCRIPTION
META-LEARNER LSTM
cell state c̃t = r✓t 1 Lt , given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
der
iderreview
a single
values as athroughconference
dataset theDcourse 2paper
Dofmeta at updates.
the ICLRtrain2017 . Suppose we have a learner neural net mode
• Training a “gradient descent procedure” applied on some learner M
meters ✓Our Let thatus we
keystart want
with itto
observation train
that
, which we on
leverage
correspondsDtrain to .the
here isThe
that
learningstandard
this update optimization
rate forresembles
the updates. the We algorithms
update let for the cell used
statet
in an LSTM
neural networks are some variant ⇥ ⇤
‣ gradient descent starts from it = of
some Wgradient
initial parameters
· r descent,
✓
L and
0,L ,✓
ct = ft t 1 ct 1 + it c̃t ,
I ✓ t t
then
t
which
performs
1 , i t 1 +uses I
updates
theb following
, of
updates: the form
(2)
ifmeaning
ft = 1, that ct 1the = ✓learning
t 1 , it =rate↵is t , aandfunction
c̃t = rof✓t the Lcurrent
t. parameter value ✓t , the current gradient
✓ = ✓ ↵ 1
r
r✓t Lt , the current loss Lt , and the previous learning rate it 1 . With this information, the meta-
t t 1 t ✓ t 1 L t ,
r key observation
Thus,
learnerweshould thatbewe
propose able leverage
training here
a meta-learner
to finely control is that LSTM
the thistoupdate
learning learn
rate ansoresembles
update
as to train rulethefor update
the for
trainingquickly
learner thewhile
a neural cell
net-
this is quite
‣ work. We similar
set the tocell
LSTM cellof
state state
the updates:
LSTM to be the parameters of the learner, or ct = ✓t , and the
an LSTMavoiding divergence.
e ✓t 1 are
candidatethe parameters
cell state c̃t =ofrthe ✓t 1 L
c learner
t , given after
= f t
how valuable
c + i 1 updates,
information↵about
c̃ , t is the learning
the gradient rate
is for at
opti-
As
the lossmization. for
optimized f , it seems
t We define by the possible
parametric that
learnerforms t the
for its optimal
t t
for tit and
th choice
1 tisn’t
ft so that
update, r✓t constant
the
t meta-learner
L is 1. Intuitively,
the can
gradient what
determine
of would
optimal
that los
t
fect
t = 1, c justify
values =
state cshrinking
through
✓
t is model , ithe
M’s the
= courseparameters
↵
parameter of
, and
space c̃of the learner and forgetting part of its previous value would be
t t = r ✓t 1 L t .
the ✓updates.
1
to parameters
t - 1
if the learner
t 1 t , and ✓t inisathe
✓t is1currently t
badupdated
local optima parameters
and needs of the learner.
a large change to escape. This would
Let
us, we propose us start
- state
correspond updatewith it , negative
toc~t ais the
training awhich
situation corresponds
meta-learner
gradient
where ther✓loss
t 1 totisthe
LSTM
L
high learning
tobut learn
the rate anforupdate
gradient theisupdates.
close rule
to We
for
zero. lettraining
Thus, oneaproposal
neural
⇥ ⇤
rk. We set for- ftthe
and forget
the it cell
are LSTM gate
state is of
gates: to have
the it be W
it =LSTM a function
I to
· r be✓t of
the that
L Linformation,
t ,parameters
t , ✓t 1 , it 1of as+the
well as
, the previous
bI learner, or ct value
= ✓tof, the
and
2
1
Let
us, we propose us start
- state
correspond updatewith it , negative
toc~t ais the
training awhich
situation corresponds
meta-learner
gradient
where ther✓loss
t 1 totisthe
LSTM
L
highlearning
tobut learn
the rate an forupdate
gradient theisupdates.
close rule
to We
for
zero. lettraining
Thus, oneaproposal
neural
⇥ ⇤
rk. We set for- ftthe
and forget
the it cell
are LSTM gate
state is of
gates: to have
the it be W
it =LSTM a function
I to
· r be✓t ofthe that
L Linformation,
t ,parameters
t , ✓t 1 , it 1of as+the
well as
, the previous
bI learner, value
or ctlearning
adaptive of the
t , and
= ✓rate
2
1
META-LEARNER
Under review as a conference paper at ICLR 2017
LSTM
• Training a “gradient descent procedure” applied on some learner M
Dtrain Dtest
(M)
(LSTM)
• Optimization as a Model for Few-Shot Learning (2017)
Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line divides
Sachin Ravi and Hugo Larochelle
examples from the training set Dtrain and test set Dtest . Each (Xi , Yi ) is the ith batch from the
A Gradient preprocessing
29
META-LEARNER LSTM
One potential challenge in training optimizers is that different input coordinates (i.e. the gradients
w.r.t. different optimizee parameters) can have very different magnitudes. This is indeed the case e.g.
when the optimizee is a neural network and different parameters correspond to weights in different
• Training a “gradient descent procedure” applied on some learner
layers. This can make training an optimizer difficult, because neural networks naturally disregard
M
small variations
‣ LSTM in input
parameters signalsacross
are shared and concentrate on (i.e.
M’s parameters bigger input
treated likevalues.
a large minibatch)
To‣this
can aim we(stop)
ignore propose to preprocess
gradients through the the optimizer’s
inputs of the LSTMinputs. One solution would be to give the
optimizer (log(|r|), sgn(r)) as an input, where r is the gradient in the current timestep. This has a
‣ gradient
problem that(and loss) inputs
log(|r|) to the
diverges forMeta-LSTM preprocessed
r ! 0. Therefore, weasuse
proposed by Andrychowicz
the following et al. (2016)
preprocessing formula
(⇣ ⌘
log(|r|)
, sgn(r) if |r| e p
rk ! p
( 1, ep r) otherwise
‣ we are careful to avoid “leakage” from batchnorm statistics between meta-train / meta-test sets
where p > 0 is a parameter controlling how small gradients are disregarded (we use p = 10 in all our
(sometimes
experiments). referred to as the “transductive setting”)
We noticed that just rescaling all inputs by an appropriate constant instead also works fine, but the
proposed preprocessing seems to be more robust and gives slightly better results on some problems.
MODEL-AGNOSTIC META-LEARNING
• Training a “gradient descent procedure” applied on some learner M
‣ MAML proposes not to bother with training an LSTM for the gradient descent updates and constant step-
size updates
‣ better results are also reported by the so-called bias transformation architecture
(One-Shot Visual Imitation Learning via Meta-Learning, Finn et al. 2017)
- concatenates to one of the layers a trainable parameter vector, for instance to the input layer [xi , ✓b ]
- decouples the updates of the bias and weights of that layer
- with it, can be shown that even a single gradient descent update yields a universal approximator over
functions mapping Dtrain and x to any label y, for a sufficiently deep ReLU network and certain losses
(Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning
Algorithm, Finn and Levine, 2018)
CHOOSING A META-LEARNER
• How to parametrize learning algorithms?
BLACK-BOX META-LEARNER
• Frame meta-learning as sequence labeling with correct labels as delayed inputs
yt yt+1 yT y0 y1
Reset memory
MEMORY-AUGMENTED
th Memory-Augmented Neural Networks NEURAL NETWORK
• Training a neural Turing machine to learn a learning algorithm
EXPERIMENT
• Mini-ImageNet (split used in Ravi & Larochelle, 2017)
Under review assubset
‣ random a conference paper
of 100 classes (64 at ICLR
training, 162017
validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset
5-class
Model
1-shot 5-shot
Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79%
Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65%
Matching Network 43.40 ± 0.78% 51.09 ± 0.71%
Matching Network FCE 43.56 ±±0.84%
43.56% 0.84% 55.31 ±±0.73%
55.31% 0.73%
Meta-Learner LSTM (OURS) 43.44 ±±0.77%
43.44% 0.77% 60.60 ±±0.71%
60.60% 0.71%
37
EXPERIMENT
• Mini-ImageNet (split used in Ravi & Larochelle, 2017)
Under review assubset
‣ random a conference paper
of 100 classes (64 at ICLR
training, 162017
validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset
5-class
Model
1-shot 5-shot
Baseline-finetune
Prototypical Nets (Snell et al.) 28.86 ±±0.54%
49.42% 0.78% 49.79 ±±0.79%
68.20% 0.66%
Baseline-nearest-neighbor
MAML (Finn et al.) 41.08 ±±0.70%
48.70% 1.84% 51.04 ±±0.65%
63.10% 0.92%
Matching
SNAIL Network
(Mishra et al.) 43.40 ±±0.78%
55.71% 0.99% 51.09 ±±0.71%
68.88% 0.98%
Matching Network FCE 43.56 ±±0.84%
43.56% 0.84% 55.31 ±±0.73%
55.31% 0.73%
Meta-Learner LSTM (OURS) 43.44 ±±0.77%
43.44% 0.77% 60.60 ±±0.71%
60.60% 0.71%
ork various statistics of the normalized distances for the prototype:
✓ ◆ 38
allows• Semi-supervised
each threshold tolearning
use information on the amount
(with distractors) of intra-cluster variation to determin
aggressively it should cut out unlabeled examples.
‣ assign soft-labels to unlabeled examples 1 2 3
? ?
P P ⇣ ⇣ ⌘⌘
i h(x )z
i i,c + j h( x̃ )z̃ m
j j,c j,c
p̃c = P P , where mj,c = c
˜
d j,c c (9
Unlabeled Set Query Set
n training with this refinement process, the model can now use its MLP in Equation 8 to lear
Testing ?
lude or ignore entirely certain unlabeled examples. The use of soft masks makes this proces Unlabeled Set Query Set
ly differentiable2 . Finally, much like for regular soft k-means (with or without a distracto
r), while we could recursively repeat the refinement for multiple steps, we found a single ste
Figure 2: Example of the semi-supervised few-shot learning setup. Training involves iterating through
• Meta-Learning for Semi-supervised Few-Shot Classification (2018)
form well enough. episodes, consisting of a support set S, an unlabeled set R, and a query set Q. The goal is to use the
Ren, Triantafillou, Ravi, Snell, Swersky, Tenenbaum, Larochelle and Zemel
items (shown with their numeric class label) in S and the unlabeled items in R within each episode to ge
ork various statistics of the normalized distances for the prototype:
✓ ◆ 38
allows• Semi-supervised
each threshold tolearning
use information on the amount
(with distractors) of intra-cluster variation to determin
aggressively it should cut out unlabeled examples.
‣ assign soft-labels to unlabeled examples 1 2 3
? ?
P P ⇣ ⇣ ⌘⌘
i h(x )z
i i,c + j h( x̃ )z̃ m
j j,c j,c
p̃c = P P , where mj,c = c
˜
d j,c c (9
Unlabeled Set Query Set
n training with this refinement process, the model can now use its MLP in Equation 8 to lear
Testing ?
lude or ignore entirely certain unlabeled examples. The use of soft masks makes this proces Unlabeled Set Query Set
ly differentiable2 . Finally, much like for regular soft k-means (with or without a distracto
r), while we could recursively repeat the refinement for multiple steps, we found a single ste
Figure 2: Example of the semi-supervised few-shot learning setup. Training involves iterating through
• Meta-Learning for Semi-supervised Few-Shot Classification (2018)
form well enough. episodes, consisting of a support set S, an unlabeled set R, and a query set Q. The goal is to use the
Ren, Triantafillou, Ravi, Snell, Swersky, Tenenbaum, Larochelle and Zemel
items (shown with their numeric class label) in S and the unlabeled items in R within each episode to ge
ork various statistics of the normalized distances for the prototype:
✓ ◆ 38
allows• Semi-supervised
each threshold tolearning
use information on the amount
(with distractors) of intra-cluster variation to determin
aggressively it should cut out unlabeled examples.
‣ assign soft-labels to unlabeled examples 1 2 3
? ?
P P ⇣ ⇣ ⌘⌘
i h(x )z
i i,c + j h( x̃ )z̃ m
j j,c j,c
p̃c = P P , where mj,c = c
˜
d j,c c (9
Unlabeled Set Query Set
n training with this refinement process, the model can now use its MLP in Equation 8 to lear
Testing ?
lude or ignore entirely certain unlabeled examples. The use of soft masks makes this proces Unlabeled Set Query Set
ly differentiable2 . Finally, much like for regular soft k-means (with or without a distracto
r), while we could recursively repeat the refinement for multiple steps, we found a single ste
Figure 2: Example of the semi-supervised few-shot learning setup. Training involves iterating through
• Meta-Learning for Semi-supervised Few-Shot Classification (2018)
form well enough. episodes, consisting of a support set S, an unlabeled set R, and a query set Q. The goal is to use the
Ren, Triantafillou, Ravi, Snell, Swersky, Tenenbaum, Larochelle and Zemel
items (shown with their numeric class label) in S and the unlabeled items in R within each episode to ge
Table in
2: Omniglot NLL 2: nats/pixel
OmniglotwithNLLfour
in nats/pixel with four Attention
support examples. support examples. Attention
Meta PixelCNN is aMeta PixelCNN is a 39
• Cold-start (a)
item recommendation
Linear Classifier with Weight Adaptation. Changes in the shading of each connection with the
output unit for two users illustrates that the weights of the classifier vary based on each user’s item
‣ given positive/negative itemsbias
history. The output for indicated
a user, produce bias ofparameters
by the shades of engagement
the circles however remains thepredictor
same. for new item
User 1 User 2
}{
F(t1 )
{ }
F (t1 )
F(t3 )
G G F(t5 )
F(t7 )
positive class
positive class
{ }G
F (t2 )
F(t4 ) F(t8 )
G } {
F(t4 )
F(t6 )
(b) Non-linear Classifier with Bias Adaptation. Changes in the shading of each unit between two
•A Meta-Learning Perspective on Cold-Start Recommendations for Items (2017)
users illustrates that the biases of these units vary based on each user’s item history. The weights
Manasi Vartak, Arvind
howeverThiagarajan, Conrado Miranda, Jeshua Bratman, Hugo Larochelle
remain the same.
42
DISCUSSION
• What is the right definition of distributions over problems?
‣ varying number of classes / examples per class ?
‣ semantic differences between meta-training vs. meta-testing classes ?
‣ overlap in meta-training vs. meta-testing classes ? (recent “low-shot” literature)
MERCI !