You are on page 1of 48

Bayesian wavelet estimators in nonparametric

regression
Natalia Bochkina
University of Edinburgh
University of Bristol - Lecture 1 -1-
Outline
Lecture 1. Classical and Bayesian approaches to estimation in nonparametric
regression
1. Classical estimators
Kernel estimators
Orthogonal series estimators
Other estimators (local polynomials, spline estimators etc)
2. Bayesian approach
Prior on coefcients in an orthogonal basis
Gaussian process priors
Other prior distributions
University of Bristol - Lecture 1 -2-
Lecture 2. Classical minimax consistency and concentration of posterior
measures
1. Decision-theoretic approach to classical consistency and concentration of
posterior measures
2. Classical consistency
Bayes and minimax estimators
Speed of convergence
Adaptivity
Lower bounds
3. Concentration of posterior measures
University of Bristol - Lecture 1 -3-
Lecture 3. Wavelet estimators in nonparametric regression
1. Thresholding estimators (universal, SURE, Block thresholding; different types of
thresholding functions).
2. Different choices of prior distributions
3. Empirical Bayes estimators (posterior mean and median, Bayes factor
estimator)
4. Optimal non-adaptive wavelet estimators
5. Optimal adaptive wavelet estimators
University of Bristol - Lecture 1 -4-
Lecture 4. Wavelet estimators: simultaneous local and global optimality
1. Separable and non-separable function estimators
2. When simultaneous local and global optimality is possible
3. Bayesian wavelet estimator that is locally and globally optimal
4. Conclusions and open questions
University of Bristol - Lecture 1 -5-
Lecture 1. Classical and Bayesian approaches to estimation in nonparametric
regression
1. Classical estimators
Kernel estimators
Orthogonal series estimators
Other estimators
2. Bayesian approach
Prior on coefcients in an orthogonal basis
Gaussian process priors
Other prior distributions
University of Bristol - Lecture 1 -6-
Lecture 1. Main references
A. Tsybakov (2009) Introduction to nonparametric estimation. Springer.
J. Ramsay and B. Silverman (2002) Functional data analysis. Springer
B. Vidakovic (1999) Statiatical modeling via wavelets. Wiley.
K. Rasmussen & C. Williams (2006) Gaussian processes for machine learning.
MIT Press.
University of Bristol - Lecture 1 -7-
Examples of nonparametric models and problems
Estimation of a probability density
Let X
1
, . . . , X
n
F iid, distribution F is absolutely continuous with respect
to the Lebesgue measure on R.
Aim: estimate the unknown density p(x) =
dF
d
.
Nonparametric regression
Assume pairs of random variables (X
1
, Y
1
), . . . , (X
n
, Y
n
) are such that
Y
i
= f(X
i
) +
i
, X
i
[0, 1],
where E(
i
) = 0 for all i. We can write f(x) = E(Y
i
| X
i
= x).
Unknown function f : [0, 1] R is called the regression function.
The problem of nonparametric regression is to estimate unknown function f.
We focus on the nonparametric regression problem, large sample properties.
University of Bristol - Lecture 1 -8-
Examples of nonparametric models and problems (cont)
White noise model
This is an idealised model that provides an approximation to the nonparametric
regression model. Consider the following stochastic differential equation:
dY (t) = f(t)dt +
1

n
dW(t), t [0, 1],
where W is a standard Wiener process on [0, 1], the function f is an unknown
function on [0, 1], and n is an integer. It is assumed that a sample path
{Y (t), 0 t 1} of the process Y is observed.
The statistical problem is to estimate the unknown function f.
First introduced in the context of nonparametric estimation by Ibragimov and
Hasminskii (1977, 1981)
Formally asymptotic equivalence was proved by Brown and Low (1996).
An extension to the multivariate case and random design regression was obtained
by Reiss (2008).
University of Bristol - Lecture 1 -9-
Parametric vs nonparametric estimation
1. Parametric estimation
If we know a priori, that unknown f (regression function or density function) belongs
to a parametric family {g(x, ) : }, where g(, ) is a given function, and
R
k
(k is xed, independent of n), then estimation of f is equivalent to
estimation of the nite-dimensional parameter .
Examples: 1. Density p is normal N(a,
2
), unknown parameter
= (a,
2
) = R R
+
.
2. Regression function f(x) is linear: f(x) = ax +b, = (a, b) = R
2
.
If such a prior information about f is not available we deal with a nonparametric
problem.
University of Bristol - Lecture 1 -10-
Parametric vs nonparametric estimation
2. Nonparametric estimation
An ill-posed problem, hence usually additional prior assumptions on f are used.
Direct assumption: f belongs to some massive class F of functions. For example,
F can be the set of all the continuous functions on R or the set of all differentiable
functions on R.
Tuning parameters of the estimators considered below are chosen to achieve best
performance in the specied class of functions.
Indirect assumptions are also used, e.g. via penalisation or prior distribution on f in
Bayesian approach.
University of Bristol - Lecture 1 -11-
Nonparametric regression estimators
1. Kernel estimators.
Density estimation: X
1
, . . . , X
n
- iid random variables with (unknown) density
p(x) wrt Lebesgue measure on R.
The corresponding distribution function is F(x) =
_
x

p(t)dt.
The empirical distribution function

F
n
(x) =
1
n
n

i=1
I(X
i
x),
where I(A) denotes the indicator function of set A. By the strong law of large
numbers, we have

F
n
(x) F(x), x R,
almost surely as n . Therefore,

F
n
(x) is a consistent estimator of F(x) for
every x R. How can we estimate the density p?
University of Bristol - Lecture 1 -12-
Kernel density estimators (cont)
One of the rst intuitive solutions is based on the following argument. For sufciently
small h > 0 we can write an approximation
p(x) = F

(x)
F(x +h) F(x h)
2h
.
Replacing F by

F
n
, we dene
p
R
n
(x) =

F
n
(x +h)

F
n
(x h)
2h
which is called Rosenblatt estimator. It can be rewritten in the form
p
R
n
(x) =
1
2nh
n

i=1
I(x h < X
i
x + h) =
1
nh
n

i=1
K
0
_
X
i
x
h
_
,
where K
0
(x) =
1
2
I(1 < x 1).
University of Bristol - Lecture 1 -13-
Kernel density estimators (cont)
A simple generalisation of the Rosenblatt estimator is given by
p
n
(x) =
1
nh
n

i=1
K
_
X
i
x
h
_
,
where K : R R is an integrable function satisfying
_
K(u)du = 1. Such a
function K is called a kernel and the parameter h is called a bandwidth of the
estimator p
n
(x). The function p
n
(x) is called the kernel density estimator or the
Parzen-Rosenblatt estimator.
Further reading: B. Silverman (1986) Density estimation for statistics and data
analysis. Wiley.
Tuning parameters: bandwidth h and kernel K.
University of Bristol - Lecture 1 -14-
Kernel estimators for regression function
Nonparametric regression model:
Y
i
= f(X
i
) +
i
, i = 1, . . . , n,
where (X
i
, Y
i
) are iid pairs, E|Y
i
| < , f(x) = E(Y
i
| X
i
= x) - regression
function to be estimated.
Given a kernel K and a bandwidth h, one can construct various kernel estimators
for nonparametric regression similar to those for density estimation. The most
celebrated one is the Nadaraya - Watson estimator.
University of Bristol - Lecture 1 -15-
Motivation for Nadaraya - Watson estimator
Suppose (X, Y ) has density p(x, y) with respect to the Lebesgue measure and
p(x) =
_
p(x, y)dy > 0. Then
f(x) = E(Y |X = x) =
_
yp(y | x)dy
p(x)
=
_
yp(x, y)dy
p(x)
.
If we replace here p(x, y) by its kernel estimator p
n
(x, y):
p
n
(x, y) =
1
nh
2
n

i=1
K
_
X
i
x
h
_
K
_
Y
i
y
h
_
,
and use the kernel estimator p
n
(x) instead of p(x), if kernel K is of order 1, we
obtain Nadaraya-Watson estimator:

f
NW
n
(x) =

n
i=1
Y
i
K
_
X
i
x
h
_

n
i=1
K
_
X
i
x
h
_ , if
n

i=1
K
_
X
i
x
h
_
= 0,
and

f
NW
n
(x) = 0 otherwise.
University of Bristol - Lecture 1 -16-
Nadaraya-Watson estimator is linear
The Nadaraya-Watson estimator can be represented as a weighted sum of Y
i
:

f
NW
n
(x) =
n

i=1
Y
i
W
NW
ni
(x),
where the weights are given by
W
NW
ni
(x) =
K
_
X
i
x
h
_

n
i=1
K
_
X
i
x
h
_I
_
n

i=1
K
_
X
i
x
h
_
_
= 0.
Denition 1. An estimator

f
n
(x) of f(x) is called a linear nonparametric
regression estimator if it can be written in the form

f
n
(x) =
n

i=1
Y
i
W
ni
(x)
where the weights W
ni
(x) = W
ni
(x, X
1
, ..., X
n
) depend only on n, i, x and
the values X
1
, . . . , X
n
.
Typically,

n
i=1
W
ni
(x) = 1 for all x (or for almost all x wrt Lebesgue measure).
University of Bristol - Lecture 1 -17-
Nadaraya - Watson estimator (cont)
If density p(x) of X
i
is known, we can use it instead of p
n
(x), then we obtain a
different kernel estimator:

f
NW
n
(x) =
1
nhp(x)
n

i=1
Y
i
K
_
X
i
x
h
_
and, in case of uniform design (X
i
U[0, 1]),

f
NW
n
(x) =
1
nh
n

i=1
Y
i
K
_
X
i
x
h
_
.
This estimator is also applicable for the regular xed design x
i
= i/n.
University of Bristol - Lecture 1 -18-
Other kernels
K(u) = (1 |u|)I(|u| 1) - triangular kernel
K(u) =
3
4
(1 u
2
)I(|u| 1) - parabolic, or Epanechnikov kernel
K(u) =
1

2
e
u
2
/2
- Gaussian kernel
K(u) =
1
2
e
|u|/

2
sin(|u|/

2 +/4) - Silverman kernel.


University of Bristol - Lecture 1 -19-
Local polynomial estimators
If the kernel K takes only nonnegative values, the Nadaraya-Watson estimator

f
NW
n
satises

f
NW
n
(x) = arg min
R
_
n

i=1
(Y
i
)
2
K
_
X
i
x
h
_
_
Thus

f
NW
n
is obtained by a local constant least squares approximation of the
outputs Y
i
.
Local polynomial least squares approximation: replace constant by a polynomial
of given degree k. If f
(k)
, then for z sufciently close to x we may write
f(z) f(x) +f

(x)(z x) +. . . +
f
(k)
(x)
k!
(z x)
k
=
T
(x)U
_
z x
h
_
,
where
U(u) = (1, u, u
2
/2!, . . . , u
k
/k!)
T
, (x) = (f(x), f

(x)h, f

(x)h
2
, . . . , f
(k)
(x)h
k
)
T
.
University of Bristol - Lecture 1 -20-
Local polynomial estimators
Denition 2. Let K : R R be a kernel, h > 0 be a bandwidth, and k > 0 be
an integer. A vector (x) R
k+1
dened by
arg min
R
k+1
_
n

i=1
_
Y
i

T
(x)U
_
z x
h
__
2
K
_
X
i
x
h
_
_
is called a local polynomial estimator of order k of f(x). The statistic

f
n
(x) = U
T
(0)

n
(x)
is called a local polynomial estimator of order k.
University of Bristol - Lecture 1 -21-
2. Projection estimators (orthogonal series estimators)
Nonparametric regression model:
Y
i
= f(x
i
) +
i
, i = 1, . . . , n,
where E
i
= 0, E
2
i
< .
Assume x
i
= i/n, f L
2
[0, 1].
Take some orthonormal basis {
k
(x)}

k=0
of L
2
[0, 1]. Then, for any
f L
2
[0, 1], {
k
}

k=0
:
f(x) =

k=0

k
(x),
and
k
=
_
1
0
f(x)
k
(x)dx.
Projection estimation of f is based on a simple idea: approximate f by its projection

N
k=0

k
(x) on the linear span of the rst N + 1 functions of the basis, and
replace
k
by their estimators.
University of Bristol - Lecture 1 -22-
Projection estimators
If X
i
are scattered over [0, 1] in a sufciently uniform way, which happens, e.g., in
the case X
i
= i/n, the coefcients
k
are well approximated by the sums
1
n

n
i=1
f(X
i
)
k
(X
i
).
Replacing in these sums the unknown quantities f(X
i
) by the observations Y
i
we
obtain the following estimators of
k
:

k
=
1
n
n

i=1
Y
i

k
(X
i
).
Denition 3. Let N 1 be an integer. The statistic

f
N
n
(x) =
N

k=0

k
(x)
is called a projection estimator (or an orthogonal series estimator) of the regression
function f at the point x.
Choice of parameter N corresponds to choosing smoothness of f.
University of Bristol - Lecture 1 -23-
Projection estimators (cont)
Note that

f
N
n
(x) is a linear estimator, since we may write it in the form

f
N
n
(x) =
n

i=1
Y
i
W
ni
(x)
with
W
ni
(x) =
1
n
N

k=0

k
(x)
k
(X
i
)
Examples:
1. Fourier basis:
2k
(x) = 1,
2k
(x) =

2 cos(2kx),

2k+1
(x) =

2 sin(2kx), k = 1, 2, . . ., x [0, 1] (Tsybakov, 2009).


2. A wavelet basis (Vidakovic, 1999)
3. An orthogonal polynomial basis:
k
(x) = (x a)
k
, k 0 (more commonly
used in the context of density estimation)
University of Bristol - Lecture 1 -24-
Generalisation to arbitrary X
i
s
Dene vectors = (
0
, . . . ,
N
)
T
and (x) = (
0
(x), . . . ,
N
(x))
T
.
The least squares estimator

LS
of the vector is dened as follows:

LS
= arg min
R
N
n

i=1
(Y
i

T
(X
i
))
2
.
If the matrix
B =
n

i=1
(X
i
)
T
(X
i
)
is invertible, we can write

LS
= B
1
n

i=1
Y
i
(X
i
).
Then the nonparametric least squares estimator of f(x) is given by

f
LS
n,N
(x) =
T
(x)

LS
.
University of Bristol - Lecture 1 -25-
Wavelet basis
Wavelet basis with periodic boundary correction on [0, 1] is
{
Lk
, k = 0, . . . , 2
L
1;
jk
, j = L, L + 1, . . . , k = 0, . . . , 2
j
1},
where
jk
(x) = 2
j/2
(2
j
x k),
jk
(x) = 2
j/2
(2
j
x k),
(x) is a scaling function, (x) is a wavelet function such that
_
(x)dx = 1,
_
(x)dx = 0.
Then, any f L
2
[0, 1] can be decomposed in the wavelet basis:
f(x) =
2
L
1

k=0

Lk
(x) +

j=L
2
j
1

k=0

jk

jk
(x),
and = {
k
,
jk
} is a set of wavelet coefcients. [Meyer, 1990]
Wavelets (, ) are said to have regularity s if they have s derivatives and has s
vanishing moments (
_
x
k
(x)dx = 0 for integer k s).
University of Bristol - Lecture 1 -26-
Examples of wavelet functions
0.0 0.2 0.4 0.6 0.8 1.0
-
1
.
5
-
0
.
5
0
.
5
1
.
0
1
.
5
Haar mother wavelet
(a) Haar wavelet
1.0 0.5 0.0 0.5 1.0 1.5

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Daub cmpct on ext. phase N=2
x

(
x
)
(b) Daubechies wavelet,
s = 2
-2 0 2 4
-
1
.
5
-
0
.
5
0
.
5
1
.
0
1
.
5
Daubechies mother wavelet
(c) Daubechies wavelet, s = 4
Localisation in time and frequency domains - sparse wavelet representation of most
functions.
University of Bristol - Lecture 1 -27-
Daubechies wavelet transform, s = 8
3 2 1 0 1 2 3

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
Daub cmpct on ext. phase N=8
x

(
x
)
2 0 2 4 6 8

1
.0

0
.5
0
.0
0
.5
1
.0
Daub cmpct on ext. phase N=8
x

(
x
)
8 6 4 2 0 2

1
.0

0
.5
0
.0
0
.5
1
.0
Daub cmpct on ext. phase N=8
x

(
x
)
0 5 10 15 20

1
.0

0
.5
0
.0
0
.5
1
.0
Daub cmpct on ext. phase N=8
x

(
x
)
5 0 5 10 15

1
.0

0
.5
0
.0
0
.5
1
.0
Daub cmpct on ext. phase N=8
x

(
x
)
15 10 5 0 5

1
.0

0
.5
0
.0
0
.5
1
.0
Daub cmpct on ext. phase N=8
x

(
x
)
20 15 10 5 0

1
.0

0
.5
0
.0
0
.5
1
.0
Daub cmpct on ext. phase N=8
x

(
x
)
University of Bristol - Lecture 1 -28-
Discrete wavelet transform (DWT)
Applying discretised wavelet transform to data yields
d
jk
= w
jk
+
jk
, L j J 1, k = 0, . . . , 2
j
1,
c
Lk
= u
Lk
+
k
, k = 0, . . . , 2
L
1,
where d
jk
and c
Lk
are discrete wavelet and scaling coefcients of observations
(y
i
), and
jk
and
k
are coefcients of the discrete wavelet transform of noise
(
i
). If
i
N(0,
2
) independent, then
jk
N(0,
2
) independent.
Connection to
jk
:

jk
=
_
1
0
f(x)
jk
(x)dx
1
n
n

i=1

jk
(i/n)f(i/n) =
1

n
(Wf
n
)
(jk)
=
w
jk

n
=:

jk
,
where W is orthonormal n n matrix, f
n
= (f(1/n), . . . , f(1)).
Also, for y
jk
= d
jk
/

n and y
k
= c
L,k
/

n, and for Gaussian noise,


y
jk
N(

jk
,
2
/n), y
k
N(

k
,
2
/n).
University of Bristol - Lecture 1 -29-
Smoothness
Fourier series - basis of Sobolev spaces W
r
p
L
2
, p [1, ], r > 0:
f W
r
p

k=1
|a
r
k

k
|
p
< ,
where a
k
= k for even k and a
k
= k 1 for odd k.
Wavelet series - basis of Besov spaces B
r
p,q
L
2
), p, q [1, ], r > 0:
f B
r
p,q

_
_
2
L
1

k=0
|
k
|
p
_
_
1/p
+
_

j=L
2
jq(r+1/21/p)
_
_
2
j
1

k=0
|
jk
|
p
_
_
p/q
_

_
1/q
<
provided regularity s of wavelet transform: s > r > 0 (Donoho and Johnstone,
1998, Theorem 2).
Embeddings: B
r
2,2
= W
r
2
.
University of Bristol - Lecture 1 -30-
Regularisation
Penalised least squares estimator of f:

f
pen
n
= arg min
fF
n

i=1
(Y
i
f(x
i
))
2
+pen(f)
where pen(f) is a penalty function, > 0 is regularisation parameter.
Example: pen(f) =
_
[f

(x)]
2
dx, leads to a cubic spline estimator (Silverman,
1985).
(see Green and Silverman, 1994, for more details).
University of Bristol - Lecture 1 -31-
Regularisation
Penalisation can be done on the coefcients of f in an orthonormal basis:

pen
n
= arg min
R
N+1
N

k=0
(y
k

k
)
2
+pen()
Examples: 1. pen() = ||||
2
2
:

k
=
1
1+
y
k
- Tikhonov regularisation, ridge
regression.
2. pen() = ||||
1
: for large enough ,

is sparse, lasso regression (Tibshirani,
1996).
Estimator

f
pen
n
(

pen
n
) coincides with MAP (maximum a posteriori) Bayesian
estimator.
University of Bristol - Lecture 1 -32-
Bayesian estimators
Likelihood:
Y
i
= f(X
i
) +
i
.
Common ways of specifying a prior distribution on a set of functions F:
On coefcients in some (orthonormal) basis, e.g. wavelet basis.
Directly on F, e.g. in terms of Gaussian processes
Inference is based on the posterior distribution (f | Y ):
p(f min Y ) =
p(y | f)p(f)
p(Y )
.
A point summary of the posterior distribution gives

f (e.g. posterior mean, median,
mode); can also obtain credibility bands for

f.
University of Bristol - Lecture 1 -33-
Bayesian projection estimators
Decomposition in some orthonormal basis:
f(x) =

k=0

k
(x).
Likelihood under the (continuous time) white noise model:
Y
k
N(
k
,
2
/n) independent
Under the nonparametric regression model: y
k
N(

k
,
2
/n), independent.
Prior on coefcients :

k
p
k
(), k = 0, . . . , N,
and P(
k
= 0) = 1 for k > N.
Prior distributions
k
can be determined by a priori smoothness assumption.
Inference is based on the posterior distribution | y:

k
can be posterior mean,
median, mode etc; variability of
k
.
University of Bristol - Lecture 1 -34-
Example: posterior mode (MAP) estimator
Suppose we have Gaussian likelihood: y
k
N(
k
,
2
/n), and prior densities

k
p
k
(), k = 0, . . . , N.
The corresponding posterior density of is
f( | y) exp
_
N

k=0
[
n
2
2
(y
k

k
)
2
+ log p
k
(
k
)]
_
.
Posterior mode (MAP) estimator:

MAP
n
= arg max
R
N+1
f( | y) = arg min
R
N+1
N

k=0
(y
k

k
)
2
+
n
pen()],
where pen() =

N
k=0
log p
k
(
k
).
For example, for a Gaussian prior
k
N(0,
2
) iid, pen() = ||||
2
2
/2
2
-
corresponds to ridge regression estimator, and for a double exponential prior
p
k
(
k
) =

2
e
|
k
|
iid, pen() = ||||
1
- corresponds to lasso regression.
University of Bristol - Lecture 1 -35-
Choice of prior distribution for Bayesian wavelet estimators
Wavelet decomposition:
f(x) =
2
L
1

k=0

Lk
(x) +

j=L
2
j
1

k=0

jk

jk
(x),
Wavelet representation of most functions is sparse, motivating the following prior
distribution for wavelet coefcients:

jk
(1
j
)
0
() +
j
h
j
(),
where h
j
() is the prior density function of non-zero wavelet coefcients, and

j
= P(
jk
= 0).
Scaling coefcients:
k
1 - noninformative prior.
University of Bristol - Lecture 1 -36-
Prior distribution of wavelet coefcients
h - normal: Clyde and George (1998), Abramovich, Sapatinas and Silverman
(1998), etc.
h - double exponential: h(x) =
1
2
e
|x|
- by Vidakovic (1998), Clyde and George
(1998), Johnstone and Silverman (2005).
h - t distribution: Bochkina and Sapatinas (2005), Johnstone and Silverman (2005).
What is corresponding a priori regularity of f?
University of Bristol - Lecture 1 -37-
A priori regularity
Studied by Abramovich et al. (1998) for normal h and
j
= min(1, c

2
j
),

j
= c

2
j
, , 0, c

, c

> 0.
Generalised to arbitrary h,
j
and
j
by Bochkina (2002)
[PhD thesis, University of Bristol]
Example:
j
= c

2
j
,
j
= min(1, c

2
j
).
Expected number of non-zero wavelet coefcients is EN =

j=j
0
2
j

j
.
Can specify
j
in such a way that:
EN = :
j
= min(1, C

2
j
) with 1;
EN < :
j
= min(1, C

2
j
) with > 1.
Consider case (0, 1].
University of Bristol - Lecture 1 -38-
Assumptions on distribution H
Suppose has distribution H.
1. 0 < 1, 1 p < , 1 q : assume that E||
p
< . If q < ,
we also assume that E||
q
< .
2. 0 < 1, p = , 1 q : assume that distribution of || has tail of
one of the following types:
(a) 1 H(x) +H(x) = c
l
x
l
[1 + o(1)] as x +, l > 0, c
l
> 0; if
q < , assume that l > q;
(b) 1 H(x) +H(x) = c
m
e
(x)
m
[1 +o(1)] as x +, m > 0,
> 0, c
m
> 0.
3. = 1, 1 p , 1 q < : assume that E||
q
< .
4. = 1, 1 p , q = : assume that > 0 such that
E[log(||)I(|| > )] < .
University of Bristol - Lecture 1 -39-
A priori regularity

H
=
_
_
_
1
l
, H has polynomial tail and p = ,
0, otherwise.
Theorem 1. Suppose that and are wavelet and scaling functions of regularity
s, where 0 < r < s. Consider function f and its wavelet transform under
assumption H.
Then, for any xed value of scaling coefcients
k
, f B
r
p,q
almost surely if and
only if
either r +
1
2


2


p
+
H
< 0,
or r +
1
2


2


p
= 0 and 0 < 1, p < , q = .
University of Bristol - Lecture 1 -40-
Nonparametric Bayesian estimators
Assume xed design (i.e. X
i
= x
i
are xed):
Y
i
= f(x
i
) +
i
, x
i
[0, 1],
with E(
i
) = 0 for all i.
Prior distribution: f G,
where G is a probability measure on a set of functions f.
University of Bristol - Lecture 1 -41-
Nonparametric Bayesian estimators: examples
1. G = GP(m(x), k(x, y)) - Gaussian process with mean function
m(x) = Ef(x) and covariance function k(x, y) = Cov(f(x), f(y)) -
symmetric and positive denite.
2. Wavelet dictionary: Abramovich, Sapatinas, Silverman (2000), Bochkina (2002):
model f as
f(x) = f
0
(x) + f
w
(x) =
M

i=1

i
(x) +

(x),
where

(x) = a
1/2
(a(x b)),

(x) = a
1/2
(a(x b))
= (a, b) [a
0
, ) [0, 1], M < and
i
< a
0
.
Take - Poisson process on R
+
[0, 1] with intensity (a, b) a

,
> 0, and

| H

() iid.
For Gaussian H

, Abramovich et al. (2000) give necessary and sufcient


conditions for f B
r
p,q
with probability 1, for more general H - in Bochkina
(2002).
University of Bristol - Lecture 1 -42-
3. Levy adaptive regression kernels: f(x) =
_
g(x, )L(d),
where L() is a Levy random measure:
L(A) =
N

k=0

k
I
A
(
j
)
where N Pois(), (
j
,
j
) (d, d) iid (C. Tu, M.Clyde, R. Wolpert,
2007).
University of Bristol - Lecture 1 -43-
Nonparametric Bayesian estimators with Gaussian process prior
Denition 4. A Gaussian process is a collection of random variables, any nite
number of which have a joint Gaussian distribution.
Assume that the observation errors are also Gaussian: Y
i
N(f(x
i
),
2
), or, in
the matrix form,
Y N
n
(f ,
2
I
n
),
where Y = (Y
1
, . . . , Y
n
)
T
, f = (f(x
1
), . . . , f(x
n
))
T
.
Often, in regression problems a priori Ef(x) = m(x) = 0.
Prior: f GP(0, k(x, y)).
University of Bristol - Lecture 1 -44-
Posterior distribution
Then, the posterior distribution of f at an arbitrary set of points
x

= (x

1
, . . . , x

m
)
T
(0, 1)
m
, f

= (f(x

1
), . . . , f(x

m
))
T
is
f

|Y, x, x

N
m
(, )
where
= k(x

, x)[k(x, x) +
2
I
n
]
1
Y,
= k(x

, x

) k(x

, x)[k(x, x) +
2
I
n
]
1
k(x

, x).
If the posterior mean is used as a point estimator, we have, for any x (0, 1):

f(x) = E(f(x)|Y, x) =
n

i=1

i
k(x
i
, x),
where = [k(x, x) +
2
I
n
]
1
Y.
This estimator is linear, and is a particular case of kernel estimator.
In addition, have posterior credible bands.
University of Bristol - Lecture 1 -45-
Bayesian nonparametric estimators with Gaussian process prior
Smoothness
If we assume f GP(0, k(x, y)), then f H
k
- Reproducing Kernel Hilbert
Space (RKHS) with kernel k(x, y).
Hence, a priori regularity of a GP f is the regularity of the corresponding RKHS H.
Orthogonal basis estimators with basis {
i
(x)} are also (implicitly) assumed to
belong to a RKHS with reproducing kernel k(x, y) =

i=1

i
(x)
i
(y).
Connection to splines:
if k(x, y): ||f||
2
H
=
_
[f

(x)]
2
dx, the corresponding MAP estimator is a cubic
spline.
The corresponding k(x, y) =
1
2
(x y)
2
min(x, y) +
1
3
[min(x, y)]
3
.
University of Bristol - Lecture 1 -46-
Regularity of Gaussian processes
1. Brownian motion: k(x, y) =
1
2
[x +y |x y|].
W(t) C[0, 1], ||W||
2
H
= W(0)
2
+||W

||
2
2
.
2. Fractional Brownian motion: k(x, y) =
1
2
[x
2
+y
2
|x y|
2
],
(0, 1). -smooth.
References:
Q Wu, F Liang, S Mukherjee, RL Wolpert (2007) Characterizing the function
space for Bayesian kernel models. Journal of Machine Learning.
A. van der Vaart and H. Zanten (2008) Rates of contraction of posterior
distributions based on Gaussian process priors. Annals of Statistics (36).
University of Bristol - Lecture 1 -47-
Next lecture
Frequentist behaviour of nonparametric estimators:
Consistency of (point) estimators

f
n
.
Concentration of posterior measures.
University of Bristol - Lecture 1 -48-

You might also like