You are on page 1of 29

Practical Bayesian Optimization of Machine

Learning Algorithms

Arturo Fernandez

CS 294
University of California, Berkeley

Tuesday, April 20, 2016

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Motivation

Machine Learning Algorithms (MLA’s) have hyperparameters


that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Motivation

Machine Learning Algorithms (MLA’s) have hyperparameters


that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size

Can we automate the optimization of these high-level


parameters?

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Motivation

Machine Learning Algorithms (MLA’s) have hyperparameters


that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size

Can we automate the optimization of these high-level


parameters?

With some assumptions and Bayesian magic, yes!

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Gaussian Process

Usually we observe inputs xi and outputs yi . For now, we


assume yi = f (xi ) (no noise) for some unkown function f .

Gaussian Processes (GP’s) approach the prediction problem by


inferring a distribution over functions given the data p(f | X, y)
and then make predictions as
Z
p(y∗ | x∗ , X, y) = p(y∗ | f, x∗ ) · p(f | X, y) df

A Gaussian Process defines a prior over functions → posterior


over functions once we see data.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Gaussian Process

A Gaussian Process is defined so that for any n ∈ N



f (x1 ), . . . , f (xn ) ∼ N (µ, K)

where µ ∈ Rn and Ki,j = κ(xi , xj ) for a positive definite kernel


function κ.

Key Idea: If xi and xj are similar by kernel, then the output of


the function at those points should be similar.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Gaussian Process

Let the prior on the regression function be a GP

f (x) ∼ GP ( m(x), κ(x, x0 ) )


m(x) = E[f (x)]
κ(x, x0 ) = E[(f (x) − m(x))(f (x0 ) − m(x0 ))T ]

are the mean and covariance/kernel function respectively.

For finite set of points, defines a joint Gaussian

p(f | X) = N (f | µ, K)

where µ = (m(x1 ), . . . , m(xn )), usually m(x) = 0.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


GP Noise-free and Multivariate Gaussian Refresher

We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )


Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP
     
f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


GP Noise-free and Multivariate Gaussian Refresher

We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )


Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP
     
f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗

Thus

f∗ ∼ p(f∗ | X∗ , X, f ) = N (µ̂, Σ̂)


µ̂ = µ(X∗ ) + KT∗ K−1 (f − µ(X))
Σ̂ = K∗∗ − KT∗ K−1 K∗

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Priors and Kernels

Samples from a prior p(f | X), using squared exponential /


Gaussian / RBF kernel
 
0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
2`
` controls horizontal scale of variation, σf2 controls vertical
variation.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Priors and Kernels

Samples from a prior p(f | X), using squared exponential /


Gaussian / RBF kernel
 
0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
2`
` controls horizontal scale of variation, σf2 controls vertical
variation.

Automatic Relevance Determination (ARD) squared


exponential kernel
 
0 1 0 T −1 0
κ(x, x ) = θ0 exp − (x − x ) Diag(θ) (x − x )
2
θ = [θ11 · · · θd2 ]

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Noisy Observations

Actually observe y where y = f (x) + ε and ε ∼ N (0, σy2 ) then

Cov(y | X) = K + σy2 I =: Ky

Assume E[f (x)] = 0 (so is y) then in the case of a single test


input

µ̂ = kT∗ K−1
y y
Σ̂ = k∗∗ − kT∗ K−1
y k∗

where

k∗ = [κ(x∗ , x1 ), . . . , κ(x∗ , xN ) ] and k∗∗ = κ(x∗ , x∗ )

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Bayesian Optimization with GP Priors
Setup for Bayesian Optimization (x vector of MLA
hyperparameters)
1. x ∈ X ⊂ RD and X bounded
2. f (x) is drawn from GP prior
3. Want to minimize f (x) on X
4. Observations are of form {xn , yn }N
n=1 with yn ∼ N (f (xn ), ν)
5. Acquisition function (AF), a : X → R+ , is used via
xnext = arg maxx a(x)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Bayesian Optimization with GP Priors
Setup for Bayesian Optimization (x vector of MLA
hyperparameters)
1. x ∈ X ⊂ RD and X bounded
2. f (x) is drawn from GP prior
3. Want to minimize f (x) on X
4. Observations are of form {xn , yn }N
n=1 with yn ∼ N (f (xn ), ν)
5. Acquisition function (AF), a : X → R+ , is used via
xnext = arg maxx a(x)

I a(x) = a(x; {xn , yn }, θ), depends on previous observations


and GP hyperparameters
I Depend on model solely through
I Predictive mean function - µ(x; {xn , yn }, θ)
I Predictive variance function - σ 2 (x; {xn , yn }, θ)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


What is f ?

I Framework useful for f when its evaluations are expensive.


I The case when requires training a machine learning
algorithm
I Thus, should be smart about where we evaluate next

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))


f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)

and N ∼ N (0, 1).

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))


f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)

and N ∼ N (0, 1).


Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))


f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)

and N ∼ N (0, 1).


Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.
2. Expected Improvement (over current best) [BCd10]

aEI (x; {xn , yn }, θ) = σ({x; xn , yn }, θ)·[γ(x)Φ(γ(x)) + φ(γ(x))]

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Covariance Function and its Hyperparameters

ARDSE kernel too smooth, instead use ARD Matérn 5/2 kernel:

KM 52 (x, x0 )
 
p 5 2 0
n p o
= θ0 1 + 5r (x, x ) + r (x, x ) exp − 5r2 (x, x0 )
2 0
3

Samples functions which are twice differentiable.


r2 (x, x0 ) = D 0 2 2
P
d=1 d − xd ) /θd .
(x

D + 3 Hyperparameters
I D Scales θ1:D
I Amplitude θ0
I Observation Noise ν
I Constant mean m

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Integrated Acquisition Function

To be fully bayesian, we should marginalize over


hyperparameters (denote by θ) by computing Integrated
Acquisition Function (IAF)
Z
â(x; {xn , yn }) = a(x; {xn , yn }, θ) · p(θ | {xn , yn }) dθ

I This expectation is a good generalization for the


uncertainty in chosen parameters
I Can blend a(·) functions arising from posterior over GP
hyperparameters, and then use a Monte Carlo estimate of
Integrated Expected Improvement (IEI)
I To do this MC, use Slice Sampling [MP10]

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Costs

I Don’t just care about minimizing f


I Evaluating f can result in vastly different execution times
depending on MLA hyperparameters
I Propose optimizing expected improvement per second

I Don’t know true f , also don’t know c(x) : X → R+ the


duration function.
I Solution: Model ln c(x) along with f , assuming
independence, makes computation easier.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Parallelization Scheme

Use batch parallelism plus sequential strategy over yet to be


evaluated points by computing MC estimattes of AF over
different possible realizations of y’s.

I N evaluations have completed, {xn , yn }N


n=1
I J evaluations pending at locations {x̄j }Jj=1
I Choose new point based on expected AFunder all possible
outcomes of pending evaluations

â(x; {xn , yn }, θ, {xj }) =


Z
a(x; {xn , yn }, θ, {xj , yj }) · p({yj } | {xj }, {xn , yn }) dy1 · · · dyJ
RJ

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Methods and Metrics

I Expected improvement with GP HP marginalization as GP


EI MCMC
I Optimizing hyperparameters as GP EI Opt,
I EI per second as GP EI per Second
I N times parallelized GP EI MCMC as Nx GP EI MCMC

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Online LDA

Hyperparameters
I Learning rate ρt = (τ0 + t)−κ → (τ0 , κ)
I minibatch size
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Results

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


3-layer CNN

Hyperparameters
I Epochs to run model
I Learning rate
I Four weight costs (one for each layer and the softmax
output weights)
I Width, scale and power of the response normalization on
the pooling layers
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


Results

Arturo Fernandez Practical Bayesian Optimization of Machine Learning


References

E. Brochu, V. M. Cora, and N. de Freitas.


A Tutorial on Bayesian Optimization of Expensive Cost
Functions, with Application to Active User Modeling and
Hierarchical Reinforcement Learning.
ArXiv e-prints, December 2010.
I. Murray and R. Prescott Adams.
Slice sampling covariance hyperparameters of latent
Gaussian models.
ArXiv e-prints, June 2010.
J. Snoek, H. Larochelle, and R. P. Adams.
Practical Bayesian Optimization of Machine Learning
Algorithms.
ArXiv e-prints, June 2012.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

You might also like