Introduction To GPR

Introduction
Introduction to Gaussian Process Regression

Hanna M. Wallach
hmw26@cam.ac.uk
January 25, 2005
Hanna M. Wallach
hmw26@cam.ac.uk
Introduction
Outline
Regression: weight-space view

Regression: function-space view (Gaussian processes)
Weight-space and function-space correspondence
Making predictions
Model selection: hyperparameters
Hanna M. Wallach
hmw26@cam.ac.uk
Introduction
Supervised Learning: Regression (1)
underlying function and noisy data

1.5
output, f(x)
1
0.5
0
0.5
1
training data
1.5
1
0.5
0
input, x
0.5
Assume an underlying process which generates clean data.

Goal: recover underlying process from noisy observed data.
Hanna M. Wallach
hmw26@cam.ac.uk
Introduction
Supervised Learning: Regression (2)
Training data are D = {x(i) , y (i) | i = 1, . . . , n}.

Each input is a vector x of dimension d.
Each target is a real-valued scalar y = f (x) + noise.
Collect inputs in d n matrix, X , and targets in vector, y :
D = {X , y}.
Wish to infer f ? for unseen input x? , using P(f ? |x? , D).
Hanna M. Wallach
hmw26@cam.ac.uk
Introduction
Gaussian Process Models: Inference in Function Space
samples from the prior
samples from the posterior
1.4
1.2
output, f(x)
output, f(x)
2
1
0
1
0
1
0.8
0.6
0.2
0.4
0.6
input, x
0.8
0.4
0
0.2
0.4
0.6
input, x
0.8
A Gaussian process defines a distribution over functions.

Inference takes place directly in function space.
Hanna M. Wallach
hmw26@cam.ac.uk
Bayesian Linear Regression
Part I
Regression: The Weight-Space View
Hanna M. Wallach
hmw26@cam.ac.uk
Bayesian Linear Regression (1)
1.5
output, f(x)
1
0.5
0
0.5
1
training data
1.5
1
0.5
0
input, x
0.5
Assuming noise N (0, 2 ), the linear regression model is:

f (x|w) = x> w, y = f + .
Hanna M. Wallach
hmw26@cam.ac.uk
Likelihood of parameters is:

P(y|X , w) = N (X > w, 2 I ).
Assume a Gaussian prior over parameters:
P(w) = N (0, p ).
Apply Bayes theorem to obtain posterior:
P(w|y, X ) P(y|X , w)P(w).
Hanna M. Wallach
hmw26@cam.ac.uk
Posterior distribution over w is:

P(w|y, X ) = N (
1 1
1
>
A X y, A1 ) where A = 1
p + 2 XX .
2
Predictive distribution is:

Z
? ?
P(f |x , X , y) = f (x? |w)P(w|X , y)dw
= N(
Hanna M. Wallach
1 ? > 1
x A X y, x? > A1 x? ).
2
hmw26@cam.ac.uk
Increasing Expressiveness
Use a set of basis functions (x) to project a d dimensional

input x into m dimensional feature space:
e.g. (x) = (1, x, x 2 , . . . )
P(f ? |x? , X , y) can be expressed in terms of inner products in

feature space:
Can now use the kernel trick.
How many basis functions should we use?
Hanna M. Wallach
hmw26@cam.ac.uk
Gaussian Process Regression
Part II
Regression: The Function-Space View
Hanna M. Wallach
hmw26@cam.ac.uk
Gaussian Processes: Definition
A Gaussian process is a collection of random variables, any

finite number of which have a joint Gaussian distribution.
Consistency:
If the GP specifies y (1) , y (2) N (, ), then it must also
specify y (1) N (1 , 11 ):
A GP is completely specified by a mean function and a

positive definite covariance function.
Hanna M. Wallach
hmw26@cam.ac.uk
Gaussian Processes: A Distribution over Functions
e.g. Choose mean function zero, and covariance function:

Kp,q = Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) )
For any set of inputs x(1) , . . . , x(n) we may compute K which
defines a joint distribution over function values:
f (x(1) ), . . . , f (x(n) ) N (0, K ).
Therefore a GP specifies a distribution over functions.
Hanna M. Wallach
hmw26@cam.ac.uk
Gaussian Processes: Simple Example
Can obtain a GP from the Bayesin linear regression model:

f (x) = x> w with w N (0, p ).
Mean function is given by:
E[f (x)] = x> E[w] = 0.
Covariance function is given by:
E[f (x)f (x0 )] = x> E[ww> ]x0 = x> p x0 .
Hanna M. Wallach
hmw26@cam.ac.uk
Weight-Space and Function Space Correspondence
For any set of m basis functions, (x), the corresponding

covariance function is:
K (x(p) , x(q) ) = (x(p) )> p (x(q) ).
Conversely, for every covariance function k, there is a possibly
infinite expansion in terms of basis functions:
K (x(p) , x(q) ) =
i i (x(p) )i (x(q) ).
i=1
Hanna M. Wallach
hmw26@cam.ac.uk
The Covariance Function

Specifies the covariance between pairs of random variables.
e.g. Squared exponential covariance function:
1
Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) ) = exp ( |x(p) x(q) |2 ).
2
K(x(p) = 5, x(q)) as a function of x(q)
1
0.8
0.6
0.4
0.2
0
0
Hanna M. Wallach
10
hmw26@cam.ac.uk
Gaussian Process Prior

Given a set of inputs x(1) , . . . , x(n) we may draw samples
f (x(1) ), . . . , f (x(n) ) from the GP prior:
f (x(1) ), . . . , f (x(n) ) N (0, K ).
Four samples:
samples from the prior
3
output, f(x)
2
1
0
1
0
Hanna M. Wallach
0.2
0.4
0.6
input, x
0.8
hmw26@cam.ac.uk
Posterior: Noise-Free Observations (1)
Given noise-free training data:

D = {x(i) , f (i) | i = 1, . . . , n} = {X , f}.
Want to make predictions f ? at test points X ? .
According to GP prior, joint distribution of f and f ? is:

f
K (X , X ) K (X , X ? )
N 0,
.
f?
K (X ? , X ) K (X ? , X ? )
Hanna M. Wallach
hmw26@cam.ac.uk
Condition {X ? , f ? } on D = {X , f} obtain the posterior.

Restrict prior to contain only functions which agree with D.
The posterior, P(f ? |X ? , X , f), is Gaussian with:
= K (X , X ? )K (X , X )1 f, and
= K (X ? , X ? ) K (X , X ? )K (X , X )1 K (X ? , X ).
Hanna M. Wallach
hmw26@cam.ac.uk
samples from the posterior

1.4
output, f(x)
1.2
1
0.8
0.6
0.4
0
0.2
0.4
0.6
input, x
0.8
Samples all agree with the observations D = {X , f}.

Greatest variance is in regions with few training points.
Hanna M. Wallach
hmw26@cam.ac.uk
Prediction: Noisy Observations
Typically we have noisy observations:

D = {X , y}, where y = f +
Assume additive noise N (0, 2 I ).
Conditioning on D = {X , y} gives a Gaussian with:
= K (X , X ? )[K (X , X ) + 2 I ]1 y, and
= K (X ? , X ? ) K (X , X ? )[K (X , X ) + 2 I ]1 K (X ? , X ).
Hanna M. Wallach
hmw26@cam.ac.uk
Model Selection: Hyperparameters

e.g. the ARD covariance function:
k(x (p) , x (q) ) = exp (
1
(x (p) x (q) )2 ).
22
How best to choose ?

samples from the posterior, = 0.1
1.5
1.3
1.3
1.2
1.2
1.1
1.1
0.5
0.5
output, f(x)
output, f(x)
output, f(x)
0.9
0.8
0.7
0.9
0.8
0.7
1
0.6
1.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
input, x
Hanna M. Wallach
0.5
0.6
0.1
0.2
0.3
0.4
0.5
0.6
input, x
0.7
0.8
0.9
0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
input, x
hmw26@cam.ac.uk
Model Selection: Optimizing Marginal Likelihood (1)
In absence of a strong prior P(), the posterior for

hyperparameter is proportional to the marginal likelihood:
P(|X , y) P(y|X , )
Choose to optimize the marginal log-likelihood:
1
log P(y|X , ) = log |K (X , X ) + 2 I |
2
1 >
n
y (K (X , X ) + 2 I )1 y log 2.
2
2
Hanna M. Wallach
hmw26@cam.ac.uk
Model Selection: Optimizing Marginal Likelihood (2)
ML = 0.3255:
1.4
1.3
output, f(x)
1.2
1.1
0.9
0.8
0.7
0.6
0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
input, x
Using ML is an approximation to the true Bayesian method of

integrating over all values weighted by their posterior.
Hanna M. Wallach
hmw26@cam.ac.uk
References
1 Carl Edward Rasmussen. Gaussian Processes in Machine

learning. Machine Learning Summer School, T ubingen, 2003.
http://www.kyb.tuebingen.mpg.de/~carl/mlss03/
2 Carl Edward Rasmussen and Chris Williams. Gaussian
Processes for Machine Learning. Forthcoming.
3 Carl Edward Rasmussen. The Gaussian Process Website.
http://www.gatsby.ucl.ac.uk/~edward/gp/
Hanna M. Wallach
hmw26@cam.ac.uk

Introduction To GPR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To GPR

Uploaded by

Copyright:

Available Formats

Introduction

Introduction to Gaussian Process Regression

January 25, 2005

Regression: weight-space view

Supervised Learning: Regression (1)

underlying function and noisy data

Assume an underlying process which generates clean data.

Supervised Learning: Regression (2)

Training data are D = {x(i) , y (i) | i = 1, . . . , n}.

Gaussian Process Models: Inference in Function Space

samples from the prior

samples from the posterior

A Gaussian process defines a distribution over functions.

Bayesian Linear Regression

Bayesian Linear Regression

Bayesian Linear Regression (1)

Assuming noise  N (0, 2 ), the linear regression model is:

Bayesian Linear Regression

Bayesian Linear Regression (2)

Likelihood of parameters is:

Bayesian Linear Regression

Bayesian Linear Regression (3)

Posterior distribution over w is:

Predictive distribution is:

Bayesian Linear Regression

Use a set of basis functions (x) to project a d dimensional

P(f ? |x? , X , y) can be expressed in terms of inner products in

How many basis functions should we use?

Gaussian Process Regression

Gaussian Process Regression

Gaussian Processes: Definition

A Gaussian process is a collection of random variables, any

A GP is completely specified by a mean function and a

Gaussian Process Regression

Gaussian Processes: A Distribution over Functions

e.g. Choose mean function zero, and covariance function:

Gaussian Process Regression

Gaussian Processes: Simple Example

Can obtain a GP from the Bayesin linear regression model:

Gaussian Process Regression

Weight-Space and Function Space Correspondence

For any set of m basis functions, (x), the corresponding

Gaussian Process Regression

The Covariance Function

Gaussian Process Regression

Gaussian Process Prior

Gaussian Process Regression

Posterior: Noise-Free Observations (1)

Given noise-free training data:

Gaussian Process Regression

Posterior: Noise-Free Observations (2)

Condition {X ? , f ? } on D = {X , f} obtain the posterior.

Gaussian Process Regression

Posterior: Noise-Free Observations (3)

samples from the posterior

Samples all agree with the observations D = {X , f}.

Gaussian Process Regression

Prediction: Noisy Observations

Typically we have noisy observations:

Gaussian Process Regression

Model Selection: Hyperparameters

How best to choose ?

samples from the posterior, = 0.3

Assuming noise N (0, 2 ), the linear regression model is: