Training RBF Networks With Selective Backpropagation

Neurocomputing 62 (2004) 39 64
www.elsevier.com/locate/neucom
Training RBF networks with selective

backpropagation
Mohammad-Taghi Vakil-Baghmisheh , Nikola Pave*si+c
Laboratory of Articial Perceptions, Systems and Cybernetics, Faculty of Electrical Engineering,
University of Ljubljana, Slovenia
Received 11 March 2002; received in revised form 8 July 2003; accepted 19 November 2003
Abstract
Backpropagation with selective training (BST) is applied on training radial basis function
(RBF) networks. It improves the performance of the RBF network substantially, in terms of
convergence speed and recognition error. Three drawbacks of the basic backpropagation algorithm, i.e. overtraining, slow convergence at the end of training, and inability to learn the last
few percent of patterns are solved. In addition, it has the advantages of shortening training
time (up to 3 times) and de-emphasizing overtrained patterns. The simulation results obtained
on 16 datasets of the Farsi optical character recognition problem prove the advantages of the
BST algorithm. Three activity functions for output cells are examined, and the sigmoid activity
function is preferred over others, since it results in less sensitivity to learning parameters, faster
convergence and lower recognition error.
c 2003 Elsevier B.V. All rights reserved.

Keywords: Neural networks; Radial basis functions; Backpropagation with selective training; Overtraining;
Farsi optical character recognition
1. Introduction
Neural networks (NNs) have been used in a broad range of applications, including:
pattern classi:cation, pattern completion, function approximation, optimization, prediction, and automatic control. In many cases, they even outperform their classical
Corresponding author. LUKS, Fakulteta za elektrotehniko, Tr*
za*ska 25, 1000-Ljubljana, Slovenia.
Tel.: +386-1-4768839; fax: +386-1-4768316.
E-mail addresses: vakil@luks.fe.uni-1j.si (M.-T. Vakil-Baghmisheh), nikola.pavesic@fe.uni-1j.si
(N. Pave*si+c).
c 2003 Elsevier B.V. All rights reserved.

0925-2312/$ - see front matter
doi:10.1016/j.neucom.2003.11.011
40
M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64
y1
x1
z1
z2
x2
xl1
zl 3
yl 2
i = 1,..., l1
m = 1,..., l2
j = 1,..., l3
Fig. 1. Con:guration of the RBF Network (for explanation see Appendix A).
counterparts. In spite of diEerent structures and training paradigms, all NNs perform essentially the same function: vector mapping. Likewise, all NN applications
are special cases of vector mapping. Development of detailed mathematical models for
NNs began in 1943 with the work of McCulloch et al. [12] and was continued by
others.
According to Wasserman [20], the :rst publication on radial basis functions for
classi:cation purposes dates back to 1964 and is attributed to Bashkirof et al. [4] and
Aizerman et al. [1]. In 1988, based on Covers theorem on the separability of patterns
[6], Broomhead et al. [5] employed radial basis functions in the design of NNs.
The RBF network is a two layered network (Fig. 1), and the common method for
its training is the backpropagation algorithm. The :rst version of the backpropagation
algorithm, based on the gradient descent method, was proposed by Werbos [21] and
Parker [13] independently, but gained popularity after publication of the seminal book
by Rumelhart et al. [15]. Since then, many modi:cations have been oEered by others,
and Jondarr [10] has reviewed 65 varieties.
Almost all variants of backpropagation algorithm were originally devised for the
multilayer perceptron (MLP). Therefore, any variant of the backpropagation algorithm
which is used for training the radial basis function (RBF) network should be customized
to suit this network, so it will be somehow diEerent from the variant suitable for the
MLP. Using the backpropagation algorithm for training RBF network has three main
drawbacks:
overtraining, which weakens the networks generalization property,
slowness at the end of training,
inability to learn the last few percent of vector associations.
A solution oEered for overtraining problem is early stopping by employing cross
validation technique [9]. There are plenty of research reports that argue against usefulness of the cross validation technique in the design and the training of NNs. For
detailed discussions the reader is invited to see [2,3,14].
41
From our point of view, there are two major reasons against using early stopping
and the cross validation technique on our data:
(1) The cross validation stops training on both learned and unlearned data. While
the logic behind early stopping is preventing overtraining on learned data, there
is no logic for stopping the training on unlearned data, when the data is not
contradictory.
(2) In the RBF and the MLP networks, learning trajectory depends on the randomly
selected initial point. This means that the optimal number of training epochs which
is obtained by CV, is useful iE we start training always from the same initial point,
and the network always traverses the same learning trajectory!
To improve the performance of the network, the authors suggest the selective training,
as there is no other way to improve the performance of the RBF network on the given
datasets. The paper shows that if we use early stopping or continue the training with
the whole dataset, the generalization error will be much more than the results obtained
by the selective training. In [19] the backpropagation with selective training (BST)
algorithm was presented for the :rst time and was used for training the MLP network.
Based on the results obtained on our OCR datasets, the BST algorithm has the
following advantages over the basic backpropagation algorithm:
prevents from overtraining,
de-emphasizes the overtrained patterns,
enables the network to learn the last percent of unlearned associations in a short
period of time.
As there is no universally eEective method, the BST algorithm is not an exception.
Since the contradictory data or the overlapping part of the data cannot be learned,
applying the selective training on data with a large overlapping area will destabilize
the system, but it is quite eEective when dataset is error-free and non-overlapping, as it
is the case with every error-free character-recognition database, when enough number
of proper features are extracted.
Organization of the paper: The RBF network is reviewed in Section 2. In Section
3, the training algorithms are presented. Simulation results are presented in Section 4,
and conclusions are given in Section 5. In addition, the paper includes two appendices.
In most of the resources the formulations for calculating error gradients of RBF
networks are either erroneous and conOicting (for instance see the formulas 4.57, 4.60,
7.53, 7.54, 7.55 in [11]), or having not been given at all (see for instance [20,16]).
Thus in Appendix A, we obtain these formulas for three forms of output cell activity function. Appendix B presents some information about feature extraction methods
used for creating the Farsi optical character recognition datasets, which are used for
simulations in this paper.
Remark. Considering that in the classi:er every pattern is represented by its feature
vector as the input vector to the classi:er, classifying the input vector is equivalent
42
to classifying the corresponding pattern. Frequently in the paper, the vector which is
to be classi:ed has been referred to by the input pattern, or simply pattern, and vice
versa.
2. RBF networks
In this section, the structure, training paradigms and initialization methods of RBF
networks are reviewed.
2.1. Structure
While there are various interpretations of RBF, in this paper we will consider
it from the pattern recognition point of view. The main idea is to divide the input space into subclasses, and to assign a prototype vector for every subclass in
the center of it. Then the membership of every input vector in each subclass will
be measured by a function of its distance from the prototype (or kernel vector),
that is fm (x) = f(x vm ). This membership function should have four speci:cations:
1.
2.
3.
4.
Attaining the maximum value in the center (zero distance).

Having considerable value in the close neighborhood of center.
Having negligible value in far distances (where are close to other centers).
DiEerentiability.
In fact, any diEerentiable and monotonically decreasing function of x vm will ful:ll these conditions, but the Gaussian function is the common choice. After obtaining
the membership values (or similarity measures) of input vector in the subclasses, the
results should be combined to obtain the membership degrees in every class. The two
layered feed-forward neural network depicted in Fig. 1 is capable of performing all the
operations, and is called the RBF network.
The neurons in the hidden layer of network have a Gaussian activity function and
their inputoutput relationship is:
x vm 2
ym = fm (x) = exp
2
2m

;
(1)
where vm is the prototype vector or the center of the mth subclass and m is the
spread parameter, through which we can control the receptive :eld of that neuron.
The receptive :eld of the mth neuron is the region in the input space where fm (x)
is high.
The neurons in the output layer could be sigmoid, linear, or pseudo-linear, i.e. linear
with some squashing property, so the output could be calculated using one of the
following equations:
1
+
eSj
s
j
;
zj =
l
sj
;

m ym
43
sigmoid;
1
squashing function;
l2
1
pseudo-linear; with
squashing function;
m ym
linear; with
(2)
where
sj =
l2

ym umj ;
j = 1; : : : ; l3 :
(3)
m=1
Although in the most of literature, the neurons with linear or pseudo-linear activity
function have been considered for the output layer, we strongly recommend using the
sigmoidal activity function, since it results in less sensitivity to learning parameters,
faster convergence and lower recognition error.
2.2. Training paradigms
Before starting the training, a cost function should be de:ned, and through the training process we will try to minimize it. Total sum-squared error (TSSE) is the most
popular cost function.
Three paradigms of training have been suggested in the literature:
1. No-training: In this the simplest case, all the parameters are calculated and :xed
in advance and no training is required. This paradigm does not have any practical
value, because the number of prototype vectors should be equal to the number of
training samples, and consequently the network will be too large and very slow.
2. Half-training: In this case the hidden layer parameters (kernel vectors and spread
parameters) are calculated and :xed in advance, and only the connection weights
of output layer are adjusted through backpropagation algorithm.
3. Full-training: This paradigm requires the training of all parameters including kernel
vectors, spread parameters, and the connection weights of output layer (vm s, m s
and umj s) through the backpropagation algorithm.
2.3. Initialization methods
The method of initialization of any parameter will depend on the selected training
paradigm. To determine the initial values of kernel vectors, many methods have been
suggested, among them the most popular are:
1. the :rst samples of the training set,
2. some randomly chosen samples from the training set,
44
3. subclass centers obtained by some clustering or classi:cation method, e.g. k-means

algorithm or LVQ algorithm.
Theodoridis [16] has reviewed some other methods and cited some related review
papers.
Wasserman [20] presents a heuristic which can be useful in determining the method
of calculating initial values of spread parameters:
Heuristic: The object is to cover the input space with receptive :elds as uniformly
as possible. If the spacing between centers are not uniform, it may be necessary for
each subclass to have its own value of . For subclasses that are widely separated
from others, must be large enough to cover the gap, whereas for subclasses that
are close to others, must have a small value.
Depending on the dataset, training paradigm, and according to the heuristic, one of the
following methods can be adopted:
1. Assigning a small :xed value, say, = 0:05 or 0.1, which requires a large number
of hidden
neurons to cover the input space.
2. = d= 2l2 , where d is the maximum distance between the chosen centers, and l2
is the number of centers.
3. In the case of using the k-means algorithm to :nd the kernel vectors, m could be
the standard deviation of the vectors in the pertaining subclass.
To assign initial values to the weights in the output layer, there are two methods:
1. Some random values in the range [ 0:1; +0:1]. This method necessitates weight
adjustment through an iterative process (the backpropagation algorithm).
2. Using the pseudo-inverse matrix to solve the following matrix equation:
YU = Z;
y1

Y = ... ;

yQ
z1

Z = ... ;

zQ
yq R l 2 ;
zq Rl3 ;
(4)
where y1 ; : : : ; yQ and z1 ; : : : ; zQ are the row vectors obtained from the hidden and output
layers, respectively, in response to the x1 ; : : : ; xQ row vectors in the input layer, and
the equation YU = Z is made as follows: for each input vector in the training set xq ,
the outputs from the hidden layer are made a row in the matrix Y, target outputs are
placed in corresponding rows of target matrix Z and each set of weights associated
with an output neuron is made a column of matrix U.
Considering that in large scale problems, the dimension of Y is high and (YT Y)1
is ill-conditioned, despite super:cial appeal of the pseudo-inverse matrix method, the
:rst iterative method is the only applicable one.
45
3. Training algorithms
In this section, we will present two training algorithms for the RBF network. First
the basic backpropagation (BB) algorithm is reviewed, and then the modi:ed algorithm
is presented.
3.1. Basic backpropagation for the RBF network
Here we will consider the algorithm for the full-training paradigm; customizing it
for half-training is straightforward and can be done simply by eliminating gradient
calculations and weight-updating corresponding to the appropriate parameters.
Algorithm.
(1) Initialize network.
(2) Forward pass: Insert the input and the desired output, compute the network outputs
by proceeding forward through the network, layer by layer.
(3) Backward pass: Calculate the error gradients versus the parameters, layer by layer,
starting from the output layer and proceeding backwards: @E=@umj , @E=@vim , @E=@m2
(see Appendix A).
(4) Update parameters:
umj (n + 1) = umj (n) 3
@E
;
@umj
(5)
vim (n + 1) = vim (n) 2
@E
;
@vim
(6)
m2 (n + 1) = m2 (n) 1
@E
;
@m2
(7)
where 1 ; 2 ; 3 are learning rate factors in the range [0; 1].

(5) Repeat the algorithm for all training inputs. If one epoch of training is :nished,
repeat the training for another epoch.
Remarks. (1) Based on our experience, the addition of the momentum termas it is
common for the MLPdoes not help in training of the RBF network.
(2) If the sigmoidal activity function is used for output cells, adding sigmoid prime
oEset 1 [8] will improve training substantially, similar to the MLP.
(3) Stopping should be decided based on the results of the network test, which
is carried out every T epochs after cost function becomes smaller than a threshold
value C.
1 As the output of the neurons approaches extreme values (0 or 1) there will be just a little learning or
no learning. A solution to this problem is adding a small oEset ( 0:1) to the derivative @zj =@sj in Eq.
(A.9), which is called the sigmoid prime o=set, thus @zj =@sj never reaches zero. Based on our experience,
adding such a term is helpful only in calculation of (A.11), but not in calculation of (A.26) or (A.30).
46
(4) To get better generalization performance, using the cross validation [9] method
has been reported in some cases as a stopping criterion; this, howeveras was mentioned in the introductionis unsatisfactory and unconvincing, because it stops training
on both learned and unlearned inputs.
(5) The number of output cells depends on the number of classes and the approach
of coding, however, it is highly recommended to make it equal to the number of
classes.
(6) Sometimes, in the net input of the sigmoid function or the linear output a constant
term is also considered (called threshold term), which is implemented using a constant
input (equal to 1). In some cases this term triggers the moving target phenomenon
and hinders training, and in some other cases without it there is no solution. Therefore,
it must be examined for every case, separately.
(7) In the rest of this paper our purpose from BB is the backpropagation algorithm
with sigmoid prime oEset as explained in footnote 1, without the momentum term.
3.2. Backpropagation with selective training
The diEerence between the BST algorithm and the BB algorithm lies in the selective
training, which is appended to the BB algorithm. When most of the vector associations
have been learned, every input vector should be checked individually, and if it is
learned there should be no training on that input, otherwise training will be carried
out. In doing so, a side eEect will arise: the stability problem. That is to say, when
we continue training on only some inputs, the network usually forgets the other input
output associations which were already learned, and in the next epoch of training it will
make wrong predictions for some of the inputs that were already classi:ed correctly.
The solution to this side eEect consists of considering a stability margin for the
de:nition of the correct classi:cation in the training step. In this way we also carry
out training on marginally learned inputs, which are on the verge of being misclassi:ed.
Selective training has its own limitations, and cannot be used on conOicting data,
or on a dataset with large overlapping areas of classes. Based on the obtained results,
using the BST algorithm on an error free OCR dataset has the following advantages:
prevents from overtraining,
de-emphasizes the overtrained patterns,
enables the network to learn the last percent of unlearned associations in a short
period of time.
BST algorithm.
(1) Start training with BB algorithm, which includes two steps:
forward propagation,
backward propagation.
(2) When the network has learned most of the vector mappings and the training procedure has slowed down, i.e. TSSE becomes smaller than a threshold value C,
stop the main algorithm and continue with selective training.
47
(3) For any pattern perform forward propagation and examine the prediction of the
network:
zJ = max(zj )
j = 1; : : : ; l3 ;
if (zJ zj + )
j = J J is the predicted class;
if (zJ zj + )
j = J no-prediction;
(8)
where is a small positive constant, and is called the stability margin.

(4) If the network makes a correct prediction, do nothing, go back to step 3 and repeat
the algorithm for the next input, else, network does not make a correct prediction
(including no-prediction case), carry out the backward propagation.
(5) If the epoch is not complete, go to step 3, else check the total number of wrong
predictions: if its trend is decreasing, go to step 3 and continue the training otherwise stop training.
(6) Examine network performance on the test set:
do only forward propagation, and then:
for any input: zJ = max (zj ) j = 1; : : : ; l3 J is the predicted class.
J
Remarks. (1) An alternative condition for starting the selective training is as follows:
after TSSE becomes smaller than a threshold value C, every T epochs carry out a
recognition test on the training set, and if it ful:lls a second threshold condition, that
is (recognition error C1 ), start selective training. The recommended value for T is
3 6 T 6 5.
(2) If is chosen large, training will be continued for most of the learned inputs,
and this will make our method ineEective. On the other hand, if is chosen too small,
during training we will face a stability problem, i.e. with a small change in the weights
valueswhich happens in every epocha considerable number of associations will be
forgotten, thus the network will oscillate and training will not proceed. After training,
by a small change in the feature vector that causes a small change in output values, the
prediction of the network will change, or for feature vectors from the same class but
with minor diEerences, we will have diEerent predictions. This also causes vulnerability
to noise and weak performance on both the test set and on the real world data out of
both the training set and the test set. The optimum value of should be small enough
to prevent training on learned inputs, but not so small as to give way to changing the
winner neuron with minor changes in weights values or input values. Our simulation
results show that for the RBF network a value in the range [0:1; 0:2] is the optimum
for our datasets.
(3) It is also possible to consider a no-prediction state in the :nal test, that is
zJ = max (zj )
j = 1; : : : ; l3 ;
if (zJ zj + 1 )
j = J J is the predicted class;
if (zJ zj + 1 )
j = J no-prediction
(9)
48
in which 0 1 . This no-prediction state will decrease the error rate at the cost
of decreasing the correct prediction rate.
4. Experiments
In this section, :rst we give some explanation about datasets, on which simulations
have been carried out. Then simulation results are presented.
4.1. Datasets
A total of 18 datasets composed of feature vectors of 32 isolated characters of the
Farsi alphabet, sorted in three groups, were created through various feature extraction
methods, including: principal component analysis (PCA), vertical, horizontal and diagonal projections, zoning, pixel change coding and some combinations of them, with
the number of features varying from 4 to 78 per character according to Table 1.
For creating these datasets, 34 Farsi fontswhich are used in publishing online
newspapers and web siteswere downloaded from the Internet. Fig. 2 demonstrates a
whole set of isolated characters of one sample font printed as text. Then, 32 isolated
characters of these 34 Farsi fonts were printed in an image :le; 11 sets of these
fonts were boldface and one set was italic. In the next step, by printing the image
:le and scanning it with various brightness and contrast levels, two additional image
:les were obtained. Then, using a 65 65 pixel size window, the character images
were separated into images of isolated characters. After applying a shift-radius invariant
Table 1
Brief information about Farsi OCR datasets
Database
Extraction method
Group A
db1
db2
db3
db4
db5
PCA
PCA
PCA
PCA
PCA
Group B
Group C
dbn1 to dbn5
db6
db7
db8
db9
db10
db11
db12
db13
PCA
Zoning
Pixel change coding
Projection
Projection
Projection
Explanation
No. of features
per character
72
54
72
64
48
Normalized versions of db1 to db5

Horizontal and vertical
Diagonal
db8+db9
db6+db7
db6+db8
db6+db7+parts of db8
4
48
48
30
78
52
52
72
49
Fig. 2. A sample set of machine-printed Farsi characters (isolated case).
image normalization [18], and by reducing the sizes of character images to 24 24

pixel, the features vectors were extracted as explained in Appendix B.
4.2. Simulation results
In all of the simulations, two-thirds of the feature vectors, obtained from the original
image and the :rst scanned image, were assigned for the training set, and one-third
of feature vectors obtained from the second scanned image were assigned for the test
set. Therefore, 68 samples per character were used for training and 34 samples per
character for testing. Thus, the total number of samples in the training sets and test
sets are 2176 and 1088, respectively.
We considered three types of activity functions for the output layer: linear,
pseudo-linear, and sigmoid. And we faced numerous problems with both linear and
pseudo-linear activity functions. These problems are explained later (in current section
see Considerations, item 2). Hence in the sequel we present only the simulation results
obtained by the sigmoidal output neurons.
In Table 2 the results obtained on datasets of groups AC, by the BST and the BB
algorithms have been compared against each other.
4.2.1. Settings and initializations
In all the cases, we considered one prototype pattern per class, i.e. 32 prototype
vectors, and 32 output cells for the output layer. Thus, the network con:guration is
l1 3232, where l1 is the dimension of the input vector which is equal to the number
of features per character (see Table 1). Adding the threshold term to the sigmoid
activity functions triggered the moving target phenomenon, so we eliminated it. Also
we did not add momentum term, because it did not help to improve training. However,
adding the sigmoid prime oEset boosted the performance of network substantially.
Two training paradigms were examined, i.e. half- and full-training. In Table 2, we
have presented only the results obtained by the full-training paradigm; the consequences
of adopting the half-training paradigm will be discussed at the end.
For initializing kernel vectors, two methods were adopted: the :rst samples, and the
cluster centers obtained by k-means algorithm. Considering that k was set equal to one,
these centers are simply the average of training samples of every class. For initializing
50
Table 2
Comparing the recognition errors of the RBF network obtained by the BB and the BST algorithms
BB
BST
Epoch
Error
Train
db1
db2
100
100
db3
db4
db5
db7
Parameters
Epochs
Error
Test
n; N
Train
11
22
25
32
100
100
100
100
11
9
14
63
7
6
6
31
db8
db10
100
100
83
20
40
10
db11
db12
100
100
49
68
28
32
db13
dbn1
100
100
4
12
2
18
dbn2
100
27
dbn3
100
dbn4
dbn5
100
100
3
17
1
9
64, 104
57, 90
100, 140
51, 90
39, 55
58, 95
65, 105
100, 140
77, 115
68, 108
100, 140
48, 65
77, 117
100, 140
49, 60
71, 113
100, 120
64, 100
100, 140
53, 93
100, 130
45, 75
65, 100
0
4
2
0
0
0
3
2
14
2
1
0
5
5
0
0
0
4
0
0
0
0
0
First phase
Second phase
Test
3 , 2 , 1
3 , 2 , 1
17
22
19
0
0
0
1
0
7
1
0
0
1
1
0
16
13
20
15
0
0
0
0
5; 0.04; 0.001
5; 0.04; 0.001
5; 0.04; 0.001
5; 0.04; 0.001
8; 0.04; 0.001
5; 0.04; 0.001
1.4; 0.001; 5e-6
1.4; 0.001; 5e-6
3; 0.001; 5e-6
6; 0.001; 5e-6
6; 0.001; 5e-6
3; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
4; 0.001; 5e-6
4; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
4; 0.001; 5e-6
4; 0.001; 5e-6
1.7; 0.013; 0.0003

1.7; 0.013; 0.0003
1.7; 0.013; 0.0003
1.7; 0.013; 0.0003
2.7; 0.013; 0.0003
1.7; 0.013; 0.0003
0.5; 0.00033; 1.7e-6
.5; 0.00033; 1.7e-6
1; 0.00033; 1.7e-6
2; 0.0003; 1.7e-6
2; 0.0003; 1.7e-6
1; 0.0003; 1.7e-6
1.7; 0.0003; 1.7e-6
1.7; 0.0003; 1.7e-6
2; 0.0003; 1.7e-6
1.7; 0.0003; 1.7e-6
1.7; 0.0003; 1.7e-6
1.3; 0.0003; 1e-6
1.3; 0.0003; 1e-6
1.7; 0.0003; 1e-6
1.7; 0.0003; 1e-6
1.3; 0.0003; 1e-6
1.3; 0.0003; 1e-6
Threshold
2
60
110
80
60
80
220
220
120
170
200
80
50
100
0
60
70
80
6
5
5
6
7
6
0.05
0.05
0.2
0.2
0.2
0.05
0.15
0.15
0.15
0.5
0.5
0.4
0.4
0.5
0.5
0.6
0.6
Database
51
spread parameters two policies were adopted:

(1) A :xed number slightly larger than the minimum of standard deviations of all
clusters created by k-means algorithm, de:ned as in Eq. (10):
2 = min(i2 );
i
(10)
where i2 is the variance of the training patterns of the ith cluster, de:ned by

1
(xq mi )2 ; for i = 1; : : : ; 32
(11)
i2 =
68 q
x cluster i
and mi is the average vector of the ith cluster.

(2) DiEerent initial values equal to the variances of every cluster obtained by the
k-means algorithm.
Total sum-squared error (TSSE) was considered as the cost function, and random
numbers in the range [ 0:1; +0:1] were assigned to the initial weight matrix of the
output layer. The results presented in Table 2 were obtained while the initial spread
parameters were equal, and the initial kernel vectors were set equal to cluster centers
of the k-means algorithm. For the BST algorithm, the stability margin was set equal
to 0:2. This value was obtained based on the empirical results.
4.2.2. Training algorithms
(1) BB algorithm: The network was trained for 100 epochs with the BB algorithm.
The obtained results are given under column BB.
(2) BST algorithm: A threshold value for TSSE was chosen, after which the training
procedure slows down. This threshold value was acquired from the :rst training
experiment through the BB algorithm. Training was restarted by the BB algorithm
from the same initial point of the :rst experiment; when TSSE reached to the
threshold value, we changed the training algorithm from unselective to selective
and continued for a maximum of 40 epochs, with the values of learning parameters,
i.e. 3 2 and 1 , decreased almost three times. Every :ve epochs the network was
tested; if the recognition error on the training set was zero, training was stopped.
Also training was stopped if either dynamic recognition error 2 reached zero or
if 40 epochs of training were over. The obtained results are given under column
BST. In this column n and N represent the epoch numbers where selective training
starts and ends.
(3) BST algorithm: We did not set any threshold; training was carried out for 100
epochs on datasets dbn1, db2, dbn2, dbn3, db7, db10 and db12 with the BB
2 Dynamic recognition error is obtained while the network is under training, and after presenting any
pattern, the network parameters probably will change. Therefore, it is diEerent to the recognition error which
is obtained after training. Generally, in selective training, the dynamic recognition error on the training set
will be larger than the recognition error, therefore we could stop training earlier by performing a test on the
training set after every few epochs of training, but this will violate the stability condition. Although, in our
case study, this does not cause any problem, it can increase recognition error in real on-line operation.
52
algorithm, then followed with selective training for maximum of 40 epochs. The
obtained results are given under column BST, where the threshold value has not
been speci:ed, or where n is equal to 100.
4.2.3. Analysis
(1) Results obtained by the BB algorithm: The recognition errors on datasets dbn4,
db13, dbn3, db4, db3 and db5 are lower than those on others, and db12, db8 and db7
must be considered the worst ones. Data normalization has improved the recognition
rate in all cases, excluding dbn5. We notice that the performances on test sets of
db1 and db2 are weaker than that on training sets of these datasets, and this must
be attributed to the inappropriate implementation of the feature extraction method.
The learning rates of kernel vectors and spread parameters, i.e. 2 and 1 are much
smaller for datasets of groups B and C than those for the datasets of group A, but
the value of 3 (the learning rate for weight matrix of the output layer), does not
change substantially for datasets from diEerent groups. The initial spread parameters
for datasets of group A are much larger than for the datasets of groups B and C.
(2) The results obtained by the BST algorithm: The :rst eminent point is the
decreased recognition errors on all datasets. The BST algorithm has achieved much
better results in shorter time; especially on db3, db4, dn5, dbn3, dbn4, dbn5, db11,
and db13 the recognition error has reached zero. For evaluating and ranking these
datasets we have two other measures: convergence speed and the number of features.
Regarding convergence speed, the best datasets will be: db4, db13, db11, dbn4, db3,
dbn3, db5, dbn5, although in some cases the diEerences are too slight to be meaningful;
db3, dbn3 and db13 should be ruled out because of high dimension of their feature
vectors. In addition, training does not bene:t from data normalization. Also the BST
algorithm solves the overtraining problem. We notice that it has decreased the error
rate on the training set but not at the cost of increased error on the test set. And this
is even more obvious from the results obtained on db1 and db2. In addition, this can
be veri:ed from the results demonstrated in Table 3.
(3) In Table 3 we have compared the recognition errors at the epochs n and N , i.e.
at the beginning and the end of the selective training obtained by the BST algorithm,
against recognition error at the same epochs obtained by the BB algorithm, on four
datasets. It shows that after TSSE reaches to the threshold value, if we continue training
with basic backpropagation the recognition error either decreases very trivially or even
increases (e.g. on test set of db4 and db11). Some researchers use the cross validation
technique to :nd this point and stop training at it, but we are opposing applying the
cross validation method on neural networks training.
(4) Table 4 shows the results of another experiment performed on db4 with diEerent
network settings. Threshold value was set equal to 100, stability margin equal to 0.1,
and learning rate parameters divided only by two for the second phase of training, and
the initial weight matrix was changed. While by epoch number 22 the error rates on
both training set and test set are equal, i.e. 26 and 14, respectively, the BST algorithm
has reached zero error on both sets in 23 epochs of selective training, and the error rates
of the BB algorithm after 100 epochs of unselective training are 6 and 3, respectively,
on training set and test set.
53
Table 3
Comparing the recognition errors of the BB and the BST algorithms in two points of training
Database
BB
Epoch
BST
Error
Epoch
Train
Test
Error
Train
Test
db1
65
104
20
11
29
25
65
104
20
0
29
17
db4
40
100
9
9
4
6
40
55
9
0
4
0
db8
78
115
89
83
41
40
78
115
89
14
41
7
db11
48
100
50
49
26
28
48
65
50
0
26
0
Table 4
Comparing the recognition errors of the RBF network on db4 obtained by the BB and the BST algorithms
BB
BST
Epoch
Error
Train
22
100
26
6
Epoch
Error
Test
n; N
Train
Test
14
3
22, 45
100, 120
0
0
0
0
(5) The reader should recall that in the selective training mode, the calculation
for weight updating (or backpropagating)which is the most time-consuming step of
trainingis carried out only for misclassi:ed patterns, whose number at the beginning
of selective training is less than 89, or 5% of all training samples (see Tables 2 and 3);
thus one epoch of selective training is at least :ve times faster than that of unselective
training, and by decreasing misclassi:ed patterns through time it becomes faster and
faster. Therefore the BST algorithm is at least three times faster than the BB algorithm.
(6) Fig. 3 demonstrates TSSE versus the epoch number, obtained on db4, corresponding with the experiment of Table 4. By changing the training algorithm we face
a sudden descent in TSSE, and this must be attributed to the sharp decrease of learning
rate factors. Our explanation for this phenomenon is as follows:
After approaching the minimum well, by using large values of learning rate and
momentum factor we step over the well. But by decreasing the step size we put our
leg in the middle of the minimum well and face a fall to the bottom. This phenomenon
inspired us to devise the BDLRF [17] and BPLRF [19] algorithms. BDLRF and BPLRF
54
TSSE
1000
BST
100
BB
10
0
20
40
60
80
Epoch number
100
120
Fig. 3. Convergence diagrams of the RBF network obtained by the BB and the BST algorithms on db4,
corresponding to Table 4.
are acronyms for backpropagation with declining leaning rate factor and backpropagation with plummeting leaning rate factor, respectively. In [17,19] we have shown
how to speed up training and improve the recognition rate in MLP by decreasing the
learning rate factor. Also we have shown that by larger decrease in the values of
training factors can result in larger decrease in cost function, and better recognition
rate.
(7) In addition, we notice that by training in the second phase while the recognition
error decreases, TSSE increases, which substantiates our statement that our method
does not overtrain the network on learned patterns. On the contrary, if the network has
been overtrained on some patterns in the :rst phase, by increasing TSSE and decreasing
recognition error it is de-emphasizing on already overtrained patterns. In other words,
decreased recognition error on unlearned patterns must be a resultant of decreased SSE
(sum-squared error) resulting from the same patterns. Thus, for an increase of TSSE,
simultaneous with decrease of recognition error, there has to be an increase in the SSE
resulting from already learned patterns without crossing the stability margin, and this
means de-emphasizing on overtrained patterns. Therefore, our method decreases the
error on the training set, but not at the cost of overtraining and increased error on the
test set.
4.2.4. Considerations
(1) By starting from a diEerent initial point, the number of training epochs will
change slightly, but not so much as to aEect the general conclusions.
(2) As already mentioned, we considered three types of activity functions for
the output layer: sigmoid, linear, and pseudo-linear. And we faced numerous problems
55
with both linear and pseudo-linear activity functions, as explained in the

following:
Slow learning: they do not allow for using large learning rate factors. In the case
of using a large learning rate, the network will oscillate and will not converge to
minimum.
High sensitivity to parameters values and wide range in which the optimal parameters
values lie for diEerent datasets. The optimal values change enormously for diEerent
datasets (up to 4 orders of magnitude), which makes the parameter tuning procedure
an exhausting task.
Contrary to their super:cial simplicity, they need far more computations per iteration
(refer to the appendix for their formulations). Thus, they are slower than the sigmoid
output from this aspect as well.
Their recognition errors are higher than that of the sigmoidal activity function.
The afore-said problems worsen on the datasets of groups B and C.
The pseudo-linear activity function has better performance than the linear one,
in terms of convergence speed, recognition rate, and sensitivity to learning
parameters.
Therefore, output cells with a sigmoid activity function are preferred over other activity functions, due to resulting in less sensitivity to learning parameters, faster convergence, and lower recognition error. Although applying the BST algorithm on the
RBF network with linear and pseudo-linear outputs does improve their performances,
they do not surpass the RBF network with sigmoid output trained with the BST
algorithm.
(3) We tried three training paradigms:
Half training: Only the weight matrix of the output layer was under training
Half training: The weight matrix of the output layer and kernel vectors were under
training.
Full training: The weight matrix of the output layer, kernel vectors and spread
parameters were under training.
If the half-training paradigm is chosenconsidering that the kernel vectors will not be
in the optimal positions and the spread parameters will not have the optimum values
we have to increase the number of kernel vectors, otherwise recognition error will
increase. In the case of increasing the number of kernel vectors, both training and
on-line operation will slow down drastically. On the other hand, if the full-training
paradigm is chosen the number of learning parameters increases to three, that is 3
2 and 1 , adjusting which is an exhausting task. All in all, the full-training paradigm
seems to be the most bene:cial method.
(4) We adopted two policies for initializing kernel vectors:
the :rst samples,
the prototype vectors resulting from the k-means algorithm.
56
Table 5
Comparing performance of the RBF network with diEerent initialization policies
BB
BST
Epoch
Error
Train
22
100
21
100
27
100
38
100
26
6
25
9
31
24
27
19
Initial values
Epochs
Error
Test
n; N
Train
Test
14
3
14
6
16
14
13
9
22, 45
100, 120
21, 45
100, 110
27, 45
100, 104
38, 78
100, 110
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Kernel vectors
2
First samples
k-Means cluster Centers
First samples
From k-means
k-Means cluster
Centers
From k-means
Although the last method of initialization does yield faster convergence, nevertheless
the diEerence between these two types of initialization becomes trivial when the number
of kernel vectors grows smaller. More precisely, when the number of kernel vectors
is kept small, using the second initialization method speeds up convergence only at
the very beginning, but in the middle and at the end of training convergence slows
down, and the global convergence is not better than that obtained by the :rst method
(see Table 5). Notwithstanding, before training the RBF network, we need to run the
k-means algorithm to get initial values for the spread parameters, and using the created
cluster centers can be done in no time.
(5) A major drawback of the RBF network lies in its size, and therefore its speed.
Unlike the MLP, in the RBF networks we cannot increase the size of network incrementally. For instance, in our case study the network size was l1 32 32; if we
had decided to enlarge the network, the next size would have been l1 64 32, and
this means that the network size would be doubled. Considering that speed will decrease by an order larger than one, the improved performance would cost a substantial
slow-down both in training and on-line operation, which would make the RBF network
unable to compete with other networks, and therefore be practically useless. Consequently, usage of the RBF network is not recommended if the number of pattern classes
is high.
(6) The best policy for initial values of spread parameters is to set equal initial values
for all of them, but to change them through training. Considering kernel vectors, the
most important aspect is adjusting them during training. In this case, selecting the :rst
samples or the prototype vectors derived from the k-means algorithm yield similar
results (see Table 5).
(7) The data normalization method oEered by Wasserman was tried [20, p. 161].
Wasserman oEers, for each component of the input training vectors:
1. Find the standard deviation over the training set.
57
2. Divide the corresponding component of each training vector by this standard deviation.
Considering that some components of feature vectors are equal for all patterns except for some from speci:c classes, the standard deviations of these components are
very small and by dividing these components to their standard deviations, their values
extremely increase, and this attenuates the impact of other components in the norm
of the diEerence vector xq vm , almost to zero. The mere impact of Wassermans
normalization method was destabilizing the whole system.
(8) Decreasing the learning parameters during selective training, stabilizes the network and speeds up training by preventing repeated overtraining and unlearning on
some patterns.
5. Conclusions
(1) In this paper we presented the BST algorithm, and showed that on the given
datasets, the BST algorithm improves the performance of RBF networks substantially, in terms of convergence speed and recognition error. The BST algorithm
achieves much better results in shorter time. It solves three drawbacks of the backpropagation algorithm: overtraining, slow convergence at the end of training, and
inability to learn the last few percent of patterns. In addition, it has the advantages of shortening training time (up to three times) and partially de-emphasizing
overtrained patterns.
(2) As there is no universally eEective method, the BST algorithm is not an exception and has its own shortcoming. Since the contradictory data or the overlapping
part of the data cannot be learned, applying the selective training on data with a
large overlapping area will destabilize the system. But it is quite eEective when
dataset is error-free and non-overlapping, as it is the case with every error-free
character-recognition database, when enough number of proper features are extracted.
(3) The best training paradigm is full training, because it utilizes all the capacity of
the network. Using the sigmoidal activity function for the neurons of the output
layer is recommended, because it results in less sensitivity to learning parameters,
faster convergence and lower recognition error.
Acknowledgements
This work has been partially supported within the framework of the Slovenian
Iranian Bilateral Scienti:c Cooperation Treaty. The authors would like to thank the
anonymous reviewers for their valuable comments and suggestions which helped to
improve the quality of the paper. We would like to thank Alan McConnell DuE for
linguistic revision.
58
Appendix A. Error gradients of RBF network

Let the network have the con:guration as depicted in Fig. 1, and
x
l1
l2
vm
V
ym
l3
uj
U
zj
tj
Q
input vector (=[x1 ; x2 ; : : : ; xl1 ]T )

dimension of the input vector
number of neurons in hidden layer
prototype vector corresponding to the mth hidden cell (=[v1m ; v2m ; : : : ; vl1 m ]T )
matrix of prototype vectors (=[v1 ; v2 ; : : : ; vl2 ])
output of mth hidden cell
dimension of the output vector
weight vector of the jth output cell (=[u1j ; u2j ; : : : ; ul2 j ]T )
weight matrix of output layer (=[u1 ; u2 ; : : : ; ul3 ])
actual output of the jth output cell
desired output of the jth output cell
number of training patterns
Let TSSE be the cost function de:ned as

q

Eq ; Eq =
(tk zkq )2
TSSE =
q
(q = 1; : : : ; Q)
(A.1)
and let E be the simpli:ed notation for Eq

E=
(tj zj )2 :
(A.2)
We will calculate error gradients for pattern mode training. Obtaining error gradients
for batch mode training is straightforward, as explained in the remark at the end of
this appendix.
We will consider three types of activity functions for output cells
; sigmoid;
1 + eSj
s
1
j
;
linear; with
squashing function;
zj =
(A.3)
l
l
2
2
sj
1
;
pseudo-linear; with
squashing function;

y
m ym
m m
where
sj =
ym umj
(A.4)
and
x vm 2
ym = exp
2m2

:
(A.5)
59
A.1. Part 1Error gradients versus weights of output layer

By using the chain rule for derivatives we get
II
III

@E
@E @zj @sj
=
@umj
@zj @sj @umj
(A.6)
and we will calculate all three terms in three cases.

Computing (I ): The :rst term is the same for all three cases, i.e.
I=
@E
= 2 (tj zj )
@zj
Computing (II ):
l2
@zj
=
II =
1
@sj

m
for cases (13):
(A.7)
for case (1);

(A.8)
ym
for case (2)
and for the third case (sigmoid output) we have

@zj
esj
1 + esj 1
=
=
= zj zj2 = zj (1 zj ):
(1 + esj )2
(1 + esj )2
@sj
II =
(A.9)
Computing (III ): The third term is identical for all cases

III =
@sj
= ym :
@umj
(A.10)
By putting all partial results together we obtain
ym
2(tj zj )
;
case(1);
l2
@E
= 2(tj zj ) ym ;
case(2);
@umj
m ym
2(tj zj ) zj (1 zj )ym ; case(3):
(A.11)
A.2. Part 2Error gradients versus components of prototype vectors

We have
I
II
III

@E @zj @ym
@E
=
:
@vim
@zj @ym @vim
j
(A.12)
Computing (I ): For all three cases we have

I=
@E
= 2(tj zj ):
@zj
(A.13)
60
Computing (II ): For case (1)

zj =
sj
l2
II =
For case (2)

zj =
sj
ym
@zj
umj
=
:
@ym
l2

m ym umj

m ym
since m is dummy variable we can change it to k

sj
k yk ukj
zj =
=
;
y
k
k
k yk

umj k yk k yk ukj
umj k yk sj
@zj

II =
=
=
:
@ym
( k yk )2
( k y k )2
(A.14)
(A.15)
(A.16)
(A.17)
For case (3)

IV

!zj
!zj !sj
=
;
!ym
!sj !ym
(A.18)
considering
zj =
and
sj =
1
;
1 + esj

umj ym ;
(A.19)
(A.20)
we have
IV = zj (1 zj );
(A.21)
V = umj :
(A.22)
and
Then by putting the partial derivatives together we obtain

II =
@zj
= zj (1 zj )umj :
@ym
Computing (III ): For all three cases we have

(x vm )2
x vm 2
ym = exp
= exp
2m2
2m2
(A.23)
(A.24)
then
@ym
= ym
III =
@vim
xim vim
m2
61

:
By putting the partial results together we have

umj ym
2(tj zj )
(xim "im );
l2 m2

umj k yk sj ym
@E

2(tj zj )
(xim "im );
=
2
(
m2
@"im
k yk )
J
ym
2(tj zj )zj (1 zj )umj 2 (xim "im );
m
j
(A.25)
case (1);
case (2);
(A.26)
case (3):
A.3. Part 3Error gradients versus spread parameters

We have
I
II
III

@E @zj @ym
@E
=
;
@m2
@zj @ym @m2
j
(A.27)
terms I and II are exactly as in part 2, therefore we only need to calculate the third
term, which will have an identical formulation in all three cases

x vm 2
(A.28)
ym = exp
2m2
and
III =
x vm 2
@ym
= ym
2
@m
2m4
by putting the partial results together we obtain

umj
x vm 2
2(t
z
)
y
j
j
m
l2
2m4

umj k yk sj
x vm 2
@E

;
2(t
z
)
y
j
j
m
=
( k y k )2
2m4
@m2

x vm 2
2(t
z
)z
(1
z
)u
y
j
j
j
j
mj
m
2m4
j
(A.29)
case(1);
case(2);
(A.30)
case(3):
Remark. All the above formulas have

Qbeen calculated for pattern mode training. For
batch mode training, only the term
q=1 should be added in front of Eqs. (A.11),
(A.26) and (A.30), i.e. all partial results should be summed up. Our experience shows
that batch mode training is much slower than pattern mode training, in addition to its
implementation intricacy and high memory demand.
62
Appendix B. Feature extraction methods

B.1. Group Adb1db5
These datasets were created by extracting the principal components, extracted by
a single layer feedforward linear neural network with generalized Hebbian Training
Algorithm (GHA) [9,7], as summarized in the following:
For extracting principal components, we used m l single layer feedforward linear
network with Generalized Hebbian Training Algorithm (GHA).
db1: To train the network: 88 non-overlapping blocks of the image of every character
were considered as an input vector. The image was scanned from top left to
bottom right and l was set equal to 8. Therefore for every character 72 features
were extracted. The training was performed with 34 samples per character and
the learning rate was set to = 7 103 . To give enough time for the network
to learn the statistics of data, training procedure was repeated for three epochs.
db2: The same as db1, but l was set equal to 6, thus for any character 54 features
were extracted.
db3: The image matrix of any character was converted into a vector, by scanning
vertically from top left to bottom right, then this vector was partitioned into 9
vectors which were inserted into the network as 9 input vectors. In this way, 72
features were extracted for every character.
db4: Similar to db1, but the dimension of the input blocks was considered to be 243,
i.e. every three rows were considered as one input vector. In this way, for any
character 64 features were extracted.
db5: Similar to db4, but l was set equal to 6, thus for any character 48 features were
extracted.
B.2. Group Bdbn1dbn5
These datasets are normalized versions of db1db5. After creating any dataset, the
feature vectors were normalized by mapping the ith component of all the vectors into
the interval [0; 1].
B.3. Group Cdb6 db13
db6: This dataset was created by the zoning method. Each character image was divided into four overlapping squares and the percentage of black pixels in each
square was obtained. The size of overlap was set to two pixels in each edge,
which yields the best recognition rate. The best recognition rate on this dataset
does not exceed 53%, so its features were used only in combination with other
features.
db7: Pixel change coding was used to extract the feature vectors of this dataset.
db8: The feature vectors of this dataset were extracted by vertical and horizontal
projections.
63
db9: The feature vectors of this dataset were extracted by diagonal projection. Ten
components from the beginning and seven components from the end were deleted,
because their values were zero for all characters. The best recognition rate on
this dataset does not reach 85%, so its features were used only in combination
with other features.
db10: By concatenating the feature vectors of db8 and db9, feature vectors of this
dataset were extracted.
db11: The feature vectors of this dataset were created by concatenating the feature
vectors of db6 and db7.
vectors of db6 and db8.
vectors of db11 and some selected features from db8, that is 10 features from
the middle of both vertical and horizontal projections.
References
[1] M.A. Aizerman, E.M. Braverman, I.E. Rozonoer, Theoretical foundations of the potential function
method in pattern recognition learning, Automat. Remote Control 25 (1964) 821837.
[2] S. Amari, N. Murata, K.R. Muller, M. Finke, H.H. Yang, Asymptotic statistical theory of overtraining
and cross-validatio, IEEE Trans. Neural Networks 8 (5) (1997) 985996.
[3] T. Andersen, T. Martinez, Cross validation and MLP architecture selection, Proceedings of International
Joint Conference on Neural Networks, IJCNN99, Cat. No. 99CH36339, Vol. 3 (part 3), 1999,
pp. 1614 1619.
[4] O.A. Bashkirov, E.M. Braverman, I.B. Muchnik, Potential function algorithms networks for pattern
recognition learning machines, Automat. Remote Control 25 (1964) 629631.
[5] D.S. Broomhead, D. Lowe, Multivariable functional interpolation and adaptive networks, Complex
Systems 2 (1988) 321355.
[6] T.M. Cover, Geometrical and statistical properties of systems of linear inequalities with application in
pattern recognition, IEEE Trans. Electron. Comput. EC-14 (1965) 326334.
[7] K.I. Diamantaras, S.Y. Kung, Principal Component Neural Networks: Theory and Applications, Wiley,
New York, 1996.
[8] S.C. Fahlman, An empirical study of learning speed in backpropagation networks, Technical Report
CMU-CS-88-162, Carnegie Mellon University, Pittsburgh, PA 15213, September 1988.
[9] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall, Englewood CliEs, NJ, USA,
1999.
[10] G. Jondarr, Backpropagation family album, Technical Report, Department of Computing, Macquarie
University, New South Wales, August 1996.
[11] C.G. Looney, Pattern Recognition Using Neural Networks, Oxford University Press, New York, 1997.
[12] W.S. McCulloch, W. Pitts, A logical calculus of the ideas imminent in nervous activity, Bull. Math.
Biophy. 5 (1943) 115133.
[13] D.B. Parker, Learning logic, Invention Report S81-64, File 1, OXce of Technology Licensing, Stanford
University, March 1982.
[14] L. Prechelt, Automatic early stopping using cross validation: quantifying the criteria, Neural Networks
11 (4) (1998) 761767.
[15] D.E. Rumelhart, J.L. McClelland, Parallel Distributed Processing: Exploration in the Microstructure of
Cognition, MIT Press, Cambridge, MA, 1986.
[16] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Academic Press, USA, 1999.
[17] M.T. Vakil-Baghmisheh, N. Pave*sic, Backpropagation with declining learning rate, Proceeding of the
10th Electrotechnical and Computer Science Conference, Portoro*z, Slovenia, Vol. B, September 2001,
pp. 297300.
64
[18] M.T. Vakil-Baghmisheh, Farsi character recognition using arti:cial neural networks, Ph.D. Thesis,
Faculty of Electrical Engineering, University of Ljubljana, Slovenia, October 2002.
[19] M.T. Vakil-Baghmisheh, N. Pave*si+c, A fast simpli:ed fuzzy ARTMAP network, Neural Process. Lett.,
2003, in press.
[20] P.D. Wasserman, Advanced Methods in Neural Computing, Van Nostrand Reinhold, New York, 1993.
[21] P.J. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral science, Ph.D.
Thesis, Harvard University, Cambridge, MA, 1974.
Mohammad-Taghi Vakil-Baghmisheh was born in 1961 in Tabriz, Iran. He received his B.Sc. and M.Sc. degrees in electronics, from Tehran University in 1987
and 1991. In 2002, he received his Ph.D. degree from the University of Ljubljana, Slovenia, with a dissertation on neural networks in the Faculty of Electrical
Engineering.
Nikola Pave0si1c was born in 1946. He received his B.Sc. degree in electronics,
M.Sc. degree in automatics, and Ph.D. degree in electrical engineering from the
University of Ljubljana, Slovenia, in 1970,1973 and 1976, respectively. Since 1970
he has been a staE member at the Faculty of Electrical Engineering in Ljubljana,
where he is currently head of the Laboratory of Arti:cial Perception, Systems and
Cybernetics. His research interests include pattern recognition, neural networks,
image processing, speech processing, and information theory. He is the author and
co-author of more than 100 papers and 3 books addressing several aspects of the
above areas.
Professor Nikola Pave*si+c is a member of IEEE, the Slovenian Association of
Electrical Engineers and Technicians (Meritorious Member), the Slovenian Pattern
Recognition Society, and the Slovenian Society for Medical and Biological Engineers. He is also a member
of the editorial boards of several technical journals.

Training RBF Networks With Selective Backpropagation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Training RBF Networks With Selective Backpropagation

Uploaded by

Copyright:

Available Formats

Neurocomputing 62 (2004) 39 64

Training RBF networks with selective

c 2003 Elsevier B.V. All rights reserved.

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

Attaining the maximum value in the center (zero distance).

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

3. subclass centers obtained by some clustering or classi:cation method, e.g. k-means

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

vim (n + 1) = vim (n) 2

m2 (n + 1) = m2 (n) 1

where 1 ; 2 ; 3 are learning rate factors in the range [0; 1].

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

j = J J is the predicted class;

where  is a small positive constant, and is called the stability margin.

j = J J is the predicted class;

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

Normalized versions of db1 to db5

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

Fig. 2. A sample set of machine-printed Farsi characters (isolated case).

image normalization [18], and by reducing the sizes of character images to 24 24

1.7; 0.013; 0.0003

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

spread parameters two policies were adopted:

and mi is the average vector of the ith cluster.

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

with both linear and pseudo-linear activity functions, as explained in the

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

k-Means cluster Centers

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

Appendix A. Error gradients of RBF network

input vector (=[x1 ; x2 ; : : : ; xl1 ]T )

Let TSSE be the cost function de:ned as

and let E be the simpli:ed notation for Eq

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

A.1. Part 1Error gradients versus weights of output layer

  

and we will calculate all three terms in three cases.

for cases (13):

for case (1);

for case (2)

and for the third case (sigmoid output) we have

Computing (III ): The third term is identical for all cases

By putting all partial results together we obtain

2(tj zj ) zj (1 zj )ym ; case(3):

A.2. Part 2Error gradients versus components of prototype vectors

  

Computing (I ): For all three cases we have

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

Computing (II ): For case (1)

For case (2)

since m is dummy variable we can change it to k

For case (3)

Then by putting the partial derivatives together we obtain

Computing (III ): For all three cases we have

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 64

vim (n + 1) = vim (n) 2

m2 (n + 1) = m2 (n) 1

where 1 ; 2 ; 3 are learning rate factors in the range [0; 1].

j = J J is the predicted class;

where is a small positive constant, and is called the stability margin.

j = J J is the predicted class;