You are on page 1of 25

Introduction

Introduction to Gaussian Process Regression


Hanna M. Wallach
hmw26@cam.ac.uk

January 25, 2005

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Introduction

Outline

Regression: weight-space view


Regression: function-space view (Gaussian processes)
Weight-space and function-space correspondence
Making predictions
Model selection: hyperparameters

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Introduction

Supervised Learning: Regression (1)

underlying function and noisy data


1.5

output, f(x)

1
0.5
0
0.5
1
training data
1.5
1

0.5

0
input, x

0.5

Assume an underlying process which generates clean data.


Goal: recover underlying process from noisy observed data.

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Introduction

Supervised Learning: Regression (2)

Training data are D = {x(i) , y (i) | i = 1, . . . , n}.


Each input is a vector x of dimension d.
Each target is a real-valued scalar y = f (x) + noise.
Collect inputs in d n matrix, X , and targets in vector, y :
D = {X , y}.
Wish to infer f ? for unseen input x? , using P(f ? |x? , D).

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Introduction

Gaussian Process Models: Inference in Function Space

samples from the prior

samples from the posterior

1.4
1.2
output, f(x)

output, f(x)

2
1
0
1
0

1
0.8
0.6

0.2

0.4
0.6
input, x

0.8

0.4
0

0.2

0.4
0.6
input, x

0.8

A Gaussian process defines a distribution over functions.


Inference takes place directly in function space.

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Bayesian Linear Regression

Part I
Regression: The Weight-Space View

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Bayesian Linear Regression

Bayesian Linear Regression (1)

1.5

output, f(x)

1
0.5
0
0.5
1
training data
1.5
1

0.5

0
input, x

0.5

Assuming noise  N (0, 2 ), the linear regression model is:


f (x|w) = x> w, y = f + .

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Bayesian Linear Regression

Bayesian Linear Regression (2)

Likelihood of parameters is:


P(y|X , w) = N (X > w, 2 I ).
Assume a Gaussian prior over parameters:
P(w) = N (0, p ).
Apply Bayes theorem to obtain posterior:
P(w|y, X ) P(y|X , w)P(w).

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Bayesian Linear Regression

Bayesian Linear Regression (3)

Posterior distribution over w is:


P(w|y, X ) = N (

1 1
1
>
A X y, A1 ) where A = 1
p + 2 XX .
2

Predictive distribution is:


Z
? ?
P(f |x , X , y) = f (x? |w)P(w|X , y)dw
= N(

Hanna M. Wallach
Introduction to Gaussian Process Regression

1 ? > 1
x A X y, x? > A1 x? ).
2

hmw26@cam.ac.uk

Bayesian Linear Regression

Increasing Expressiveness

Use a set of basis functions (x) to project a d dimensional


input x into m dimensional feature space:
e.g. (x) = (1, x, x 2 , . . . )

P(f ? |x? , X , y) can be expressed in terms of inner products in


feature space:
Can now use the kernel trick.

How many basis functions should we use?

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Part II
Regression: The Function-Space View

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Gaussian Processes: Definition

A Gaussian process is a collection of random variables, any


finite number of which have a joint Gaussian distribution.
Consistency:
If the GP specifies y (1) , y (2) N (, ), then it must also
specify y (1) N (1 , 11 ):

A GP is completely specified by a mean function and a


positive definite covariance function.

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Gaussian Processes: A Distribution over Functions

e.g. Choose mean function zero, and covariance function:


Kp,q = Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) )
For any set of inputs x(1) , . . . , x(n) we may compute K which
defines a joint distribution over function values:
f (x(1) ), . . . , f (x(n) ) N (0, K ).
Therefore a GP specifies a distribution over functions.

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Gaussian Processes: Simple Example

Can obtain a GP from the Bayesin linear regression model:


f (x) = x> w with w N (0, p ).
Mean function is given by:
E[f (x)] = x> E[w] = 0.
Covariance function is given by:
E[f (x)f (x0 )] = x> E[ww> ]x0 = x> p x0 .

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Weight-Space and Function Space Correspondence

For any set of m basis functions, (x), the corresponding


covariance function is:
K (x(p) , x(q) ) = (x(p) )> p (x(q) ).
Conversely, for every covariance function k, there is a possibly
infinite expansion in terms of basis functions:
K (x(p) , x(q) ) =

i i (x(p) )i (x(q) ).

i=1

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

The Covariance Function


Specifies the covariance between pairs of random variables.
e.g. Squared exponential covariance function:
1
Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) ) = exp ( |x(p) x(q) |2 ).
2
K(x(p) = 5, x(q)) as a function of x(q)
1
0.8
0.6
0.4
0.2
0
0

Hanna M. Wallach
Introduction to Gaussian Process Regression

10

hmw26@cam.ac.uk

Gaussian Process Regression

Gaussian Process Prior


Given a set of inputs x(1) , . . . , x(n) we may draw samples
f (x(1) ), . . . , f (x(n) ) from the GP prior:
f (x(1) ), . . . , f (x(n) ) N (0, K ).
Four samples:
samples from the prior
3

output, f(x)

2
1
0
1
0

Hanna M. Wallach
Introduction to Gaussian Process Regression

0.2

0.4
0.6
input, x

0.8

hmw26@cam.ac.uk

Gaussian Process Regression

Posterior: Noise-Free Observations (1)

Given noise-free training data:


D = {x(i) , f (i) | i = 1, . . . , n} = {X , f}.
Want to make predictions f ? at test points X ? .
According to GP prior, joint distribution of f and f ? is:
 
 

f
K (X , X ) K (X , X ? )
N 0,
.
f?
K (X ? , X ) K (X ? , X ? )

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Posterior: Noise-Free Observations (2)

Condition {X ? , f ? } on D = {X , f} obtain the posterior.


Restrict prior to contain only functions which agree with D.
The posterior, P(f ? |X ? , X , f), is Gaussian with:
= K (X , X ? )K (X , X )1 f, and
= K (X ? , X ? ) K (X , X ? )K (X , X )1 K (X ? , X ).

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Posterior: Noise-Free Observations (3)

samples from the posterior


1.4

output, f(x)

1.2
1
0.8
0.6
0.4
0

0.2

0.4
0.6
input, x

0.8

Samples all agree with the observations D = {X , f}.


Greatest variance is in regions with few training points.

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Prediction: Noisy Observations

Typically we have noisy observations:


D = {X , y}, where y = f + 
Assume additive noise  N (0, 2 I ).
Conditioning on D = {X , y} gives a Gaussian with:
= K (X , X ? )[K (X , X ) + 2 I ]1 y, and
= K (X ? , X ? ) K (X , X ? )[K (X , X ) + 2 I ]1 K (X ? , X ).

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Model Selection: Hyperparameters


e.g. the ARD covariance function:
k(x (p) , x (q) ) = exp (

1
(x (p) x (q) )2 ).
22

How best to choose ?


samples from the posterior, = 0.1

samples from the posterior, = 0.3

1.5

samples from the posterior, = 0.5

1.3

1.3

1.2

1.2

1.1

1.1

0.5

0.5

output, f(x)

output, f(x)

output, f(x)

0.9

0.8

0.7

0.9

0.8

0.7

1
0.6

1.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

input, x

Hanna M. Wallach
Introduction to Gaussian Process Regression

0.5

0.6

0.1

0.2

0.3

0.4

0.5

0.6

input, x

0.7

0.8

0.9

0.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

input, x

hmw26@cam.ac.uk

Gaussian Process Regression

Model Selection: Optimizing Marginal Likelihood (1)

In absence of a strong prior P(), the posterior for


hyperparameter is proportional to the marginal likelihood:
P(|X , y) P(y|X , )
Choose to optimize the marginal log-likelihood:
1
log P(y|X , ) = log |K (X , X ) + 2 I |
2
1 >
n
y (K (X , X ) + 2 I )1 y log 2.
2
2

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

Model Selection: Optimizing Marginal Likelihood (2)

ML = 0.3255:
samples from the posterior, = 0.3255
1.4

1.3

output, f(x)

1.2

1.1

0.9

0.8

0.7

0.6

0.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

input, x

Using ML is an approximation to the true Bayesian method of


integrating over all values weighted by their posterior.

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

Gaussian Process Regression

References

1 Carl Edward Rasmussen. Gaussian Processes in Machine


learning. Machine Learning Summer School, T ubingen, 2003.
http://www.kyb.tuebingen.mpg.de/~carl/mlss03/
2 Carl Edward Rasmussen and Chris Williams. Gaussian
Processes for Machine Learning. Forthcoming.
3 Carl Edward Rasmussen. The Gaussian Process Website.
http://www.gatsby.ucl.ac.uk/~edward/gp/

Hanna M. Wallach
Introduction to Gaussian Process Regression

hmw26@cam.ac.uk

You might also like