You are on page 1of 16

Prediction Networks

Prediction
Predict f(t) based on values of f(t 1), f(t 2),
Two NN models: feedforward and recurrent
A simple example (section 3.7.3)
Forecasting gold price at a month based on its prices at
previous months
Using a BP net with a single hidden layer
1 output node: forecasted price for month t
k input nodes (using price of previous k months for prediction)
k hidden nodes
Training sample: for k = 2: {(x
t-2
, x
t-1
) x
t
}
Raw data: gold prices for 100 consecutive months, 90 for
training, 10 for cross validation testing
one-lag forecasting: predict x
t
based on x
t-2
and x
t-1

multilag: using predicted values for further forecasting

Prediction Networks
Training:
Three attempts:
k = 2, 4, 6
Learning rate = 0.3,
momentum = 0.6
25,000 50,000
epochs
2-2-2 net with
good prediction
Two larger nets
over-trained
Results
Network MSE
2-2-1 Training 0.0034
one-lag 0.0044
multilag 0.0045
4-4-1 Training 0.0034
one-lag 0.0098
multilag 0.0100
6-6-1 Training 0.0028
one-lag 0.0121
multilag 0.0176

Prediction Networks
Generic NN model for prediction
Preprocessor prepares training samples from time series data
Train predictor using samples (e.g., by BP learning)



Preprocessor
In the previous example,
Let k = d + 1 (using previous d + 1data points to predict)

More general:
c
i
is called a kernel function for different memory model (how
previous data are remembered)
Examples: exponential trace memory; gamma memory (see p.141)
) (t x ) (t x
) (t x
d i i t x t x t x t x t x t x
i d
,..., 0 ), ( ) ( where )) ( ),..., ( ), ( ( ) (
1 0
= = =
Prediction Networks
Recurrent NN architecture
Cycles in the net
Output nodes with connections to hidden/input nodes
Connections between nodes at the same layer
Node may connect to itself
Each node receives external input as well as input from other
nodes
Each node may be affected by output of every other node
With a given external input vector, the net often converges to an
equilibrium state after a number of iterations (output of every node
stops to change)
An alternative NN model for function approximation
Fewer nodes, more flexible/complicated connections
Learning is often more complicated
Prediction Networks
Approach I: unfolding to a
feedforward net
Each layer represents a time delay
of the network evolution
Weights in different layers are
identical

Cannot directly apply BP learning
(because weights in different
layers are constrained to be
identical)
How many layers to unfold to?
Hard to determine
A fully connected net of 3 nodes
Equivalent FF net of k layers
Prediction Networks
Approach II: gradient descent
A more general approach
Error driven: for a given external input


Weight update

known) are output (desired nodes output are where
) ( )) ( ) ( ( ) (
2 2
k
t e t o t d t E
k k k k k

= =
j i
j i
w
t E
t w
,
,
) (
) (
c
c
q = A
NN of Radial Basis Functions
Motivations: better performance than Sigmoid function
Some classification problems
Function interpolation
Definition
A function is radial symmetric (or is RBF) if its output depends on
the distance between the input vector and a stored vector to that
function


Output
NN with RBF node function are called RBF-nets
RBF with the associated
vector the is or, input vect the is where Distance = i i u
2 1 2 1
whenever ) ( ) ( u u u u < >
NN of Radial Basis Functions
Gaussian function is the most widely used RBF
a bell-shaped function centered at u = 0.
Continuous and differentiable

Other RBF
Inverse quadratic function, hypersh]pheric function, etc
2
) / (
) (
c u
g
e u


) ( 2 )' ) / ( ( ) ( then ) ( if
2 ) / ( ' ) / (
2 2
u u c u e u e u
g
c u
g
c u
g
= = =

Gaussian function

Inverse quadratic
function
0 , ) ( ) (
2 2
2
< + =
|
for u c u

hyperspheric function

>
s
=
c u
c u
u
s
if 0
if 1
) (

NN of Radial Basis Functions
Pattern classification
4 or 5 sigmoid hidden nodes
are required for a good
classification
Only 1 RBF node is required
if the function can
approximate the circle
x
x
x
x
x
x
x x
x
x
x
NN of Radial Basis Functions
XOR problem
2-2-1 network
2 hidden nodes are RBF:


Output node can be step or sigmoid
When input x is applied
Hidden node calculates distance
then its output
All weights to hidden nodes set to 1
Weights to output node trained by
LMS
t
1
and t
2
can also been trained
] 0 , 0 [ , ) (
] 1 , 1 [ , ) (
2 2
1 1
2
2
2
1
= =
= =


t e x
t e x
t x
t x
j
t x
x
(1,1) 1 0.1353
(0,1) 0.3678 0.3678
(0,0) 0.1353 1
(1,0) 0.3678 0.3678
) (
1
x ) (
2
x
(0, 0)
(1, 1)
(0, 1)
(1, 0)
NN of Radial Basis Functions
Function interpolation
Suppose you know and , to approximate
( ) by linear interpolation:

Let be the distances of from and
then
i.e., sum of function values, weighted and normalized by distances
Generalized to interpolating by more than 2 known f values



Only those with small distance to are useful

) (
1
x f
) (
2
x f ) (
0
x f
) /( ) ))( ( ) ( ( ) ( ) (
1 2 1 0 1 2 1 0
x x x x x f x f x f x f + =
2 0 1
x x x < <
0 2 2 1 0 1
, x x D x x D = =
0
x
1
x
2
x
] /[ )] ( ) ( [ ) (
1
2
1
1 2
1
2 1
1
1 0

+ + = D D x f D x f D x f
0 0
1 1
2
1
1
1
2
1
2 1
1
1
0
to neighbors of number the is where
) ( ) ( ) (
) (
0
0 0
x P
D D D
x f D x f D x f D
x f
P
P P


+ + +
+ + +
=

) (
i
x f
0
x
NN of Radial Basis Functions
Example:
8 samples with known
function values
can be interpolated
using only 4 nearest
neighbors
) (
0
x f
) , , , (
5 4 3 2
x x x x
1
5
1
4
1
3
1
2
1
5
1
4
1
3
1
2
1
5
1
4
1
3
1
2
5
1
5 4
1
4 3
1
3 2
1
2
0
8 3 9 8
) ( ) ( ) ( ) (
) (




+ + +
+ + +
=
+ + +
+ + +
=
D D D D
D D D D
D D D D
x f D x f D x f D x f D
x f
Using RBF node to achieve neighborhood effect
One hidden node per sample:
Network output for approximating is proportional to
1
) (

= D D
) (
0
x f
Clustering samples
Too many hidden nodes when # of samples is large
Grouping similar samples together into N clusters, each with
The center: vector
Desired mean output:
Network output:
Suppose we know how to determine N and how to cluster all P
samples (not a easy task itself), and can be determined by
learning
NN of Radial Basis Functions
i

Learning in RBF net


Objective:
learning

to minimize

Gradient descent approach



One can also obtain by other clustering techniques, then use
GD learning for only
NN of Radial Basis Functions
i

Polynomial Networks
Polynomial networks
Node functions allow direct computing of polynomials
of inputs
Approximating higher order functions with fewer nodes
(even without hidden nodes)
Each node has more connection weights
Higher-order networks


# of weights per node:
Can be trained by LMS
|
.
|

\
|
+ +
|
.
|

\
|
+
|
.
|

\
|
+
k
kn n n

2
2
1
1
Polynomial Networks
Sigma-pi networks

Does not allow terms with higher powers of inputs, so they are not
a general function approximater
# of weights per node:
Can be trained by LMS
Pi-sigma networks
One hidden layer with Sigma function:
Output nodes with Pi function:
Product units:
Node computes product:
Integer power P
j,i
can be learned
Often mix with other units (e.g., sigmoid)
|
.
|

\
|
+ +
|
.
|

\
|
+
|
.
|

\
|
+
k
n n n

2 1
1

You might also like