Regression I I I

Radial Basis Function Networks
CSCI 5521: Paul Schrater
Introduction
In this lecture we will look at RBFs, networks where the activation of hidden is based on the distance between the input vector and a prototype vector
Radial Basis Functions have a number of interesting properties

Strong connections to other disciplines
function approximation, regularization theory, density estimation and interpolation in the presence of noise [Bishop, 1995]
RBFs allow for a straightforward interpretation of the internal representation produced by the hidden layer RBFs have training algorithms that are signicantly faster than those for MLPs RBFs can be implemented as support vector machines
Radial Basis Functions

Discriminant function exibility
NON-Linear
But with sets of linear parameters at each layer Provably general function approximators for sufcient nodes
Error function
Mixed- Different error criteria typically used for hidden vs. output layers. Hidden: Input approx. Output: Training error
Optimization
Simple least squares - with hybrid training solution is unique.
Two layer Non-linear NN
Non-linear functions are radially symmetric kernels, with free parameters CSCI 5521: Paul Schrater
Input to Hidden Mapping

Each region of feature space by means of a radially symmetric function
Activation of a hidden unit is determined by the DISTANCE between the input vector x and a prototype vector
Choice of radial basis function

Although several forms of radial basis may be used, Gaussian kernels are most commonly used
The Gaussian kernel may have a full-covariance structure, which requires D(D+3)/2 parameters to be learned
or a diagonal structure, with only (D+1) independent parameters
In practice, a trade-off exists between using a small number of basis with many parameters or a larger number of less exible functions [Bishop, 1995]
Builds up a function out of Spheres
RBF vs. NN
RBFs have their origins in techniques for performing exact interpolation

These techniques place a basis function at each of the training examples f(x) and compute the coefcients wk so that the mixture model has zero error at examples
Formally: Exact Interpolation

Goal: Find a function h(x), such that h(xi) = ti, for inputs xi and targets ti
y = h( x) = %"11 "12 '" 21 " 22 'M M ' &" n1 " n 2
$ w "( x # x )
i i i = 1 :n
Form
Solve for w
L "1n (% w1 ( % t1 ( L " 2 n *'w 2 * 't 2 * = ' * " ij = " x i # x j * ' * M M M M ' * ' * L " nn * w t )&r n ) & n ) r "w = t r #1 r T T w= " " " t
Recall the Kernel trick
Solving conditions
Micchellis Theorem: If the points xi are distinct, then the ! matrix will be nonsingular.
Mhaskar and Micchelli, Approximation by superposition of sigmoidal and radial basis functions, Advances in Applied Mathematics,13, 350-373, 1992.
Radial Basis function Networks

n outputs, m basis functions y k = h k ( x) = =
# w " ( x) + w
ki i i= 1:m
k0
& x$ j 2 ) e.g. " i ( x ) = exp( $ 2% 2 + j * '
# w " (x),
ki i i= 0:m
" 0 (x ) = 1
Rewrite this as : ," 0 ( x ) / r r r y = W" ( x ), " ( x ) = . M 1 . 1 " ( x ) - m 0
Learning
RBFs are commonly trained following a hybrid procedure that operates in two stages
Unsupervised selection of RBF centers
RBF centers are selected so as to match the distribution of training examples in the input feature space This is the critical step in training, normally performed in a slow iterative manner A number of strategies are used to solve this problem
Supervised computation of output vectors

Hidden-to-output weight vectors are determined so as to minimize the sumsquared error between the RBF outputs and the desired targets Since the outputs are linear, the optimal weights can be computed using fast, linear
Least squares solution for output weights

Given L input vectors xl, with labels tl
Form the least square error function : E=
k 2 l
#
l =1:L
#{ y k ( x l ) " t
, t lk is the k th value of the target vector
r t r r r E = # ( y ( x l ) " tl ) ( y ( x l ) " tl )
l =1:L k =1:n
r r t r r t t = # ($ ( x l )W " tl ) ($ ( x l )W " tl )
l =1:L
E=
# (%W
l =1:L
" T) (%W t " T) =
# W% %W
t l =1:L
" 2W%t T +Tt T
&E = 0 = 2%t%W t " 2%t T dW
W = (% %) %t T
t t
"1
Solving for input parameters

Now we have a linear solution, if we know the input parameters. How do we set the input parameters? Gradient descent--like Back-prop
Density estimation viewpoint: By looking at the meaning of the RBF network for classification, the input parameters can be estimated via density estimation and/or clustering techniques We will skip these in lecture
Unsupervised methods overview

Random selection of centers
The simplest approach is to randomly select a number of training examples as RBF centers
This method has the advantage of being very fast, but the network will likely require an excessive number of centers
Once the center positions have been selected, the spread parameters j can be estimated, for instance, from the average distance between neighboring centers
Clustering
Alternatively, RBF centers may be obtained with a clustering procedure such as the k-means algorithm The spread parameters can be computed as before, or from the sample covariance of the examples of each cluster
Density estimation
The position of the RB centers may also be obtained by modeling the feature space density with a Gaussian Mixture Model using the Expectation- Maximization algorithm The spread parameters for each center are automatically obtained from the covariance matrices of the corresponding Gaussian components
Probabilistic Interpretion of RBFs for Classification
Introduce Mixture
Prob RBF cond
where
Basis function outputs: Posterior jth Feature probabilities
weights: Class probability given jth Feature value
Regularized Spline Fits

Given
$ #1 ( X1 ) L # N ( X1 ) ' & ) "=& M ) & %#1 ( X M ) L # N ( X M )) (

y pred = "w Minimize L = ( y - "w ) ( y - "w ) + #w T $w
T
Make a penalty on large curvature $ d 2# i ( x ) '$ d 2# j ( x ) ' "ij = * & )dx )& 2 2 % dx (% dx (
Minimize
T
L = ( y - "w ) ( y - "w ) + #w T $w Solution

r w = (" " + #$) " y

T %1 T
Generalized Ridge Regression
Equivalent Kernel
Linear Regression solution admit a kernel interpretation
r w = (" " + #$) " y
T %1 T
r y pred ( x ) = "( x ) w = "( x )(" " + #$) " y

T %1 T
Let y pred
(" " + #$) " = S " r ( x ) = "( x )' S & ( x ) y r = ' "( x ) S & ( x ) y = ' K ( x, x ) y
T T
%1
#
i
Eigenanalysis and DF
An eigenanalysis of the effective kernel gives the effective degrees of freedom (DF), i.e. # of basis functions
Evaluate K at all points K ij = K ( x i , x j ) Eigenanalysis K = VDV "1 = USU T
Equivalent Basis Functions
Computing Error Bars on Predictions

It is important to be able to compute error bars on linear regression predictions. We can do that by propagating the error in the measured values to the predicted values.
Simple Least squares Given the Gram Matrix $ #1 ( X1 ) L # N ( X1 ) ' *1 T r & ) T "=& M w = (" ") " y ) & %#1 ( X M ) L # N ( X M )) ( *1 T r r T predictions y pred = "w = "(" ") " y = Sy r r T T T cov[ y ] = cov[ S y ] = S cov[ y ] S = S + I S = + SS ( ) Error pred

Regression I I I

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression I I I

Uploaded by

Copyright:

Available Formats

Radial Basis Function Networks

CSCI 5521: Paul Schrater

Radial Basis Functions have a number of interesting properties

Radial Basis Functions

Two layer Non-linear NN

Input to Hidden Mapping

Choice of radial basis function

or a diagonal structure, with only (D+1) independent parameters

Builds up a function out of Spheres

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

RBFs have their origins in techniques for performing exact interpolation

CSCI 5521: Paul Schrater

Formally: Exact Interpolation

y = h( x) = %"11 "12 '" 21 " 22 'M M ' &" n1 " n 2

Recall the Kernel trick

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

Radial Basis function Networks

& x$ j 2 ) e.g. " i ( x ) = exp( $ 2% 2 + j * '

Rewrite this as : ," 0 ( x ) / r r r y = W" ( x ), " ( x ) = . M 1 . 1 " ( x ) - m 0

CSCI 5521: Paul Schrater

Supervised computation of output vectors

CSCI 5521: Paul Schrater

Least squares solution for output weights

, t lk is the k th value of the target vector

" T) (%W t " T) =

" 2W%t T +Tt T

&E = 0 = 2%t%W t " 2%t T dW

Solving for input parameters

CSCI 5521: Paul Schrater

Unsupervised methods overview

CSCI 5521: Paul Schrater

Probabilistic Interpretion of RBFs for Classification

Prob RBF cond

Basis function outputs: Posterior jth Feature probabilities

weights: Class probability given jth Feature value

CSCI 5521: Paul Schrater

Regularized Spline Fits

$ #1 ( X1 ) L # N ( X1 ) ' & ) "=& M ) & %#1 ( X M ) L # N ( X M )) (

L = ( y - "w ) ( y - "w ) + #w T $w Solution

r w = (" " + #$) " y

Generalized Ridge Regression

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

r y pred ( x ) = "( x ) w = "( x )(" " + #$) " y

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

Equivalent Basis Functions

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

Computing Error Bars on Predictions

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

CSCI 5521: Paul Schrater

You might also like