Practical Bayesian Optimization of Machine

Learning Algorithms

Arturo Fernandez

CS 294
University of California, Berkeley

Tuesday, April 20, 2016

Machine Learning Algorithms (MLA’s) have hyperparameters

that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size

Can we automate the optimization of these high-level


Can we automate the optimization of these high-level


With some assumptions and Bayesian magic, yes!

Gaussian Process

Usually we observe inputs xi and outputs yi . For now, we

assume yi = f (xi ) (no noise) for some unkown function f .

Gaussian Processes (GP’s) approach the prediction problem by

inferring a distribution over functions given the data p(f | X, y)
and then make predictions as
p(y∗ | x∗ , X, y) = p(y∗ | f, x∗ ) · p(f | X, y) df

A Gaussian Process defines a prior over functions → posterior

over functions once we see data.

Gaussian Process

A Gaussian Process is defined so that for any n ∈ N

f (x1 ), . . . , f (xn ) ∼ N (µ, K)

where µ ∈ Rn and Ki,j = κ(xi , xj ) for a positive definite kernel

function κ.

Key Idea: If xi and xj are similar by kernel, then the output of

the function at those points should be similar.

Gaussian Process

Let the prior on the regression function be a GP

f (x) ∼ GP ( m(x), κ(x, x0 ) )

m(x) = E[f (x)]
κ(x, x0 ) = E[(f (x) − m(x))(f (x0 ) − m(x0 ))T ]

are the mean and covariance/kernel function respectively.

For finite set of points, defines a joint Gaussian

p(f | X) = N (f | µ, K)

where µ = (m(x1 ), . . . , m(xn )), usually m(x) = 0.

GP Noise-free and Multivariate Gaussian Refresher

We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )

Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP
f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗

f∗ ∼ p(f∗ | X∗ , X, f ) = N (µ̂, Σ̂)

µ̂ = µ(X∗ ) + KT∗ K−1 (f − µ(X))
Σ̂ = K∗∗ − KT∗ K−1 K∗

Priors and Kernels

Samples from a prior p(f | X), using squared exponential /

Gaussian / RBF kernel
0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
` controls horizontal scale of variation, σf2 controls vertical

Automatic Relevance Determination (ARD) squared

exponential kernel
0 1 0 T −1 0
κ(x, x ) = θ0 exp − (x − x ) Diag(θ) (x − x )
θ = [θ11 · · · θd2 ]

Noisy Observations

Actually observe y where y = f (x) + ε and ε ∼ N (0, σy2 ) then

Cov(y | X) = K + σy2 I =: Ky

Assume E[f (x)] = 0 (so is y) then in the case of a single test


µ̂ = kT∗ K−1
y y
Σ̂ = k∗∗ − kT∗ K−1
y k∗


k∗ = [κ(x∗ , x1 ), . . . , κ(x∗ , xN ) ] and k∗∗ = κ(x∗ , x∗ )

Bayesian Optimization with GP Priors
Setup for Bayesian Optimization (x vector of MLA
1. x ∈ X ⊂ RD and X bounded
2. f (x) is drawn from GP prior
3. Want to minimize f (x) on X
4. Observations are of form {xn , yn }N
n=1 with yn ∼ N (f (xn ), ν)
5. Acquisition function (AF), a : X → R+ , is used via
xnext = arg maxx a(x)

I a(x) = a(x; {xn , yn }, θ), depends on previous observations

and GP hyperparameters
I Depend on model solely through
I Predictive mean function - µ(x; {xn , yn }, θ)
I Predictive variance function - σ 2 (x; {xn , yn }, θ)

What is f ?

I Framework useful for f when its evaluations are expensive.

I The case when requires training a machine learning
I Thus, should be smart about where we evaluate next

Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))

f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)

and N ∼ N (0, 1).

Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.

2. Expected Improvement (over current best) [BCd10]

aEI (x; {xn , yn }, θ) = σ({x; xn , yn }, θ)·[γ(x)Φ(γ(x)) + φ(γ(x))]

Covariance Function and its Hyperparameters

ARDSE kernel too smooth, instead use ARD Matérn 5/2 kernel:

KM 52 (x, x0 )
p 5 2 0
n p o
= θ0 1 + 5r (x, x ) + r (x, x ) exp − 5r2 (x, x0 )
2 0

Samples functions which are twice differentiable.

r2 (x, x0 ) = D 0 2 2
d=1 d − xd ) /θd .

D + 3 Hyperparameters
I D Scales θ1:D
I Amplitude θ0
I Observation Noise ν
I Constant mean m

Integrated Acquisition Function

To be fully bayesian, we should marginalize over

hyperparameters (denote by θ) by computing Integrated
Acquisition Function (IAF)
â(x; {xn , yn }) = a(x; {xn , yn }, θ) · p(θ | {xn , yn }) dθ

I This expectation is a good generalization for the

uncertainty in chosen parameters
I Can blend a(·) functions arising from posterior over GP
hyperparameters, and then use a Monte Carlo estimate of
Integrated Expected Improvement (IEI)
I To do this MC, use Slice Sampling [MP10]

I Don’t just care about minimizing f

I Evaluating f can result in vastly different execution times
depending on MLA hyperparameters
I Propose optimizing expected improvement per second

I Don’t know true f , also don’t know c(x) : X → R+ the

duration function.
I Solution: Model ln c(x) along with f , assuming
independence, makes computation easier.

Parallelization Scheme

Use batch parallelism plus sequential strategy over yet to be

evaluated points by computing MC estimattes of AF over
different possible realizations of y’s.

I N evaluations have completed, {xn , yn }N

I J evaluations pending at locations {x̄j }Jj=1
I Choose new point based on expected AFunder all possible
outcomes of pending evaluations

â(x; {xn , yn }, θ, {xj }) =

a(x; {xn , yn }, θ, {xj , yj }) · p({yj } | {xj }, {xn , yn }) dy1 · · · dyJ

Methods and Metrics

I Expected improvement with GP HP marginalization as GP

I Optimizing hyperparameters as GP EI Opt,
I EI per second as GP EI per Second
I N times parallelized GP EI MCMC as Nx GP EI MCMC

Online LDA

I Learning rate ρt = (τ0 + t)−κ → (τ0 , κ)
I minibatch size
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)

3-layer CNN

I Epochs to run model
I Learning rate
I Four weight costs (one for each layer and the softmax
output weights)
I Width, scale and power of the response normalization on
the pooling layers
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)

