Practical Bayesian Optimization of ML Algorithms

Practical Bayesian Optimization of Machine
Learning Algorithms
Arturo Fernandez
CS 294
University of California, Berkeley
Tuesday, April 20, 2016
Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Motivation
Machine Learning Algorithms (MLA’s) have hyperparameters

that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size

Motivation

I step size
I minibatch size
Can we automate the optimization of these high-level

parameters?

Motivation

I step size
I minibatch size
Can we automate the optimization of these high-level

parameters?
With some assumptions and Bayesian magic, yes!

Gaussian Process
Usually we observe inputs xi and outputs yi . For now, we

assume yi = f (xi ) (no noise) for some unkown function f .
Gaussian Processes (GP’s) approach the prediction problem by

inferring a distribution over functions given the data p(f | X, y)
and then make predictions as
Z
p(y∗ | x∗ , X, y) = p(y∗ | f, x∗ ) · p(f | X, y) df
A Gaussian Process defines a prior over functions → posterior

over functions once we see data.

Gaussian Process
A Gaussian Process is defined so that for any n ∈ N

f (x1 ), . . . , f (xn ) ∼ N (µ, K)
where µ ∈ Rn and Ki,j = κ(xi , xj ) for a positive definite kernel

function κ.
Key Idea: If xi and xj are similar by kernel, then the output of

the function at those points should be similar.

Gaussian Process
Let the prior on the regression function be a GP
f (x) ∼ GP ( m(x), κ(x, x0 ) )

m(x) = E[f (x)]
κ(x, x0 ) = E[(f (x) − m(x))(f (x0 ) − m(x0 ))T ]
are the mean and covariance/kernel function respectively.
For finite set of points, defines a joint Gaussian
p(f | X) = N (f | µ, K)
where µ = (m(x1 ), . . . , m(xn )), usually m(x) = 0.

GP Noise-free and Multivariate Gaussian Refresher
We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )

Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP

f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗

GP Noise-free and Multivariate Gaussian Refresher
We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )

Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP

f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗
Thus
f∗ ∼ p(f∗ | X∗ , X, f ) = N (µ̂, Σ̂)

µ̂ = µ(X∗ ) + KT∗ K−1 (f − µ(X))
Σ̂ = K∗∗ − KT∗ K−1 K∗

Priors and Kernels
Samples from a prior p(f | X), using squared exponential /

Gaussian / RBF kernel

0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
2`
` controls horizontal scale of variation, σf2 controls vertical
variation.

Priors and Kernels
Samples from a prior p(f | X), using squared exponential /

Gaussian / RBF kernel

0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
2`
` controls horizontal scale of variation, σf2 controls vertical
variation.
Automatic Relevance Determination (ARD) squared

exponential kernel

0 1 0 T −1 0
κ(x, x ) = θ0 exp − (x − x ) Diag(θ) (x − x )
2
θ = [θ11 · · · θd2 ]

Noisy Observations
Actually observe y where y = f (x) + ε and ε ∼ N (0, σy2 ) then
Cov(y | X) = K + σy2 I =: Ky
Assume E[f (x)] = 0 (so is y) then in the case of a single test

input
µ̂ = kT∗ K−1
y y
Σ̂ = k∗∗ − kT∗ K−1
y k∗
where
k∗ = [κ(x∗ , x1 ), . . . , κ(x∗ , xN ) ] and k∗∗ = κ(x∗ , x∗ )

Bayesian Optimization with GP Priors
Setup for Bayesian Optimization (x vector of MLA
hyperparameters)
1. x ∈ X ⊂ RD and X bounded
2. f (x) is drawn from GP prior
3. Want to minimize f (x) on X
4. Observations are of form {xn , yn }N
n=1 with yn ∼ N (f (xn ), ν)
5. Acquisition function (AF), a : X → R+ , is used via
xnext = arg maxx a(x)

Bayesian Optimization with GP Priors
Setup for Bayesian Optimization (x vector of MLA
hyperparameters)
1. x ∈ X ⊂ RD and X bounded
2. f (x) is drawn from GP prior
3. Want to minimize f (x) on X
4. Observations are of form {xn , yn }N
n=1 with yn ∼ N (f (xn ), ν)
5. Acquisition function (AF), a : X → R+ , is used via
xnext = arg maxx a(x)
I a(x) = a(x; {xn , yn }, θ), depends on previous observations

and GP hyperparameters
I Depend on model solely through
I Predictive mean function - µ(x; {xn , yn }, θ)
I Predictive variance function - σ 2 (x; {xn , yn }, θ)

What is f ?
I Framework useful for f when its evaluations are expensive.

I The case when requires training a machine learning
algorithm
I Thus, should be smart about where we evaluate next

Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.
aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))

f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)
and N ∼ N (0, 1).


γ(x) =
σ(x; {xn , yn }, θ)
and N ∼ N (0, 1).

Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.


γ(x) =
σ(x; {xn , yn }, θ)
and N ∼ N (0, 1).

Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.
2. Expected Improvement (over current best) [BCd10]
aEI (x; {xn , yn }, θ) = σ({x; xn , yn }, θ)·[γ(x)Φ(γ(x)) + φ(γ(x))]

Covariance Function and its Hyperparameters
ARDSE kernel too smooth, instead use ARD Matérn 5/2 kernel:
KM 52 (x, x0 )

p 5 2 0
n p o
= θ0 1 + 5r (x, x ) + r (x, x ) exp − 5r2 (x, x0 )
2 0
3
Samples functions which are twice differentiable.

r2 (x, x0 ) = D 0 2 2
P
d=1 d − xd ) /θd .
(x
D + 3 Hyperparameters
I D Scales θ1:D
I Amplitude θ0
I Observation Noise ν
I Constant mean m

Integrated Acquisition Function
To be fully bayesian, we should marginalize over

hyperparameters (denote by θ) by computing Integrated
Acquisition Function (IAF)
Z
â(x; {xn , yn }) = a(x; {xn , yn }, θ) · p(θ | {xn , yn }) dθ
I This expectation is a good generalization for the

uncertainty in chosen parameters
I Can blend a(·) functions arising from posterior over GP
hyperparameters, and then use a Monte Carlo estimate of
Integrated Expected Improvement (IEI)
I To do this MC, use Slice Sampling [MP10]

Costs
I Don’t just care about minimizing f

I Evaluating f can result in vastly different execution times
depending on MLA hyperparameters
I Propose optimizing expected improvement per second
I Don’t know true f , also don’t know c(x) : X → R+ the

duration function.
I Solution: Model ln c(x) along with f , assuming
independence, makes computation easier.

Parallelization Scheme
Use batch parallelism plus sequential strategy over yet to be

evaluated points by computing MC estimattes of AF over
different possible realizations of y’s.
I N evaluations have completed, {xn , yn }N

n=1
I J evaluations pending at locations {x̄j }Jj=1
I Choose new point based on expected AFunder all possible
outcomes of pending evaluations
â(x; {xn , yn }, θ, {xj }) =

Z
a(x; {xn , yn }, θ, {xj , yj }) · p({yj } | {xj }, {xn , yn }) dy1 · · · dyJ
RJ

Methods and Metrics
I Expected improvement with GP HP marginalization as GP

EI MCMC
I Optimizing hyperparameters as GP EI Opt,
I EI per second as GP EI per Second
I N times parallelized GP EI MCMC as Nx GP EI MCMC

Online LDA
Hyperparameters
I Learning rate ρt = (τ0 + t)−κ → (τ0 , κ)
I minibatch size
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)

Results

3-layer CNN
Hyperparameters
I Epochs to run model
I Learning rate
I Four weight costs (one for each layer and the softmax
output weights)
I Width, scale and power of the response normalization on
the pooling layers
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)

Results

References
E. Brochu, V. M. Cora, and N. de Freitas.

A Tutorial on Bayesian Optimization of Expensive Cost
Functions, with Application to Active User Modeling and
Hierarchical Reinforcement Learning.
ArXiv e-prints, December 2010.
I. Murray and R. Prescott Adams.
Slice sampling covariance hyperparameters of latent
Gaussian models.
ArXiv e-prints, June 2010.
J. Snoek, H. Larochelle, and R. P. Adams.
Practical Bayesian Optimization of Machine Learning
Algorithms.
ArXiv e-prints, June 2012.

Practical Bayesian Optimization of ML Algorithms

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practical Bayesian Optimization of ML Algorithms

Uploaded by

Copyright:

Available Formats

Practical Bayesian Optimization of Machine

Tuesday, April 20, 2016

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Machine Learning Algorithms (MLA’s) have hyperparameters

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Machine Learning Algorithms (MLA’s) have hyperparameters

Can we automate the optimization of these high-level

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Machine Learning Algorithms (MLA’s) have hyperparameters

Can we automate the optimization of these high-level

With some assumptions and Bayesian magic, yes!

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Usually we observe inputs xi and outputs yi . For now, we

Gaussian Processes (GP’s) approach the prediction problem by

A Gaussian Process defines a prior over functions → posterior

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

A Gaussian Process is defined so that for any n ∈ N

where µ ∈ Rn and Ki,j = κ(xi , xj ) for a positive definite kernel

Key Idea: If xi and xj are similar by kernel, then the output of

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Let the prior on the regression function be a GP

f (x) ∼ GP ( m(x), κ(x, x0 ) )

are the mean and covariance/kernel function respectively.

For finite set of points, defines a joint Gaussian

where µ = (m(x1 ), . . . , m(xn )), usually m(x) = 0.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )

f∗ ∼ p(f∗ | X∗ , X, f ) = N (µ̂, Σ̂)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Samples from a prior p(f | X), using squared exponential /

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Samples from a prior p(f | X), using squared exponential /

Automatic Relevance Determination (ARD) squared

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Actually observe y where y = f (x) + ε and ε ∼ N (0, σy2 ) then

Assume E[f (x)] = 0 (so is y) then in the case of a single test

k∗ = [κ(x∗ , x1 ), . . . , κ(x∗ , xN ) ] and k∗∗ = κ(x∗ , x∗ )

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

I a(x) = a(x; {xn , yn }, θ), depends on previous observations

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

I Framework useful for f when its evaluations are expensive.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))

and N ∼ N (0, 1).

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))

and N ∼ N (0, 1).

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))

and N ∼ N (0, 1).

aEI (x; {xn , yn }, θ) = σ({x; xn , yn }, θ)·[γ(x)Φ(γ(x)) + φ(γ(x))]

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Samples functions which are twice differentiable.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

To be fully bayesian, we should marginalize over

I This expectation is a good generalization for the

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

I Don’t just care about minimizing f

I Don’t know true f , also don’t know c(x) : X → R+ the