Professional Documents
Culture Documents
www.elsevier.com/locate/neucom
Abstract
Backpropagation with selective training (BST) is applied on training radial basis function
(RBF) networks. It improves the performance of the RBF network substantially, in terms of
convergence speed and recognition error. Three drawbacks of the basic backpropagation algorithm, i.e. overtraining, slow convergence at the end of training, and inability to learn the last
few percent of patterns are solved. In addition, it has the advantages of shortening training
time (up to 3 times) and de-emphasizing overtrained patterns. The simulation results obtained
on 16 datasets of the Farsi optical character recognition problem prove the advantages of the
BST algorithm. Three activity functions for output cells are examined, and the sigmoid activity
function is preferred over others, since it results in less sensitivity to learning parameters, faster
convergence and lower recognition error.
c 2003 Elsevier B.V. All rights reserved.
Keywords: Neural networks; Radial basis functions; Backpropagation with selective training; Overtraining;
Farsi optical character recognition
1. Introduction
Neural networks (NNs) have been used in a broad range of applications, including:
pattern classi:cation, pattern completion, function approximation, optimization, prediction, and automatic control. In many cases, they even outperform their classical
Corresponding author. LUKS, Fakulteta za elektrotehniko, Tr*
za*ska 25, 1000-Ljubljana, Slovenia.
Tel.: +386-1-4768839; fax: +386-1-4768316.
E-mail addresses: vakil@luks.fe.uni-1j.si (M.-T. Vakil-Baghmisheh), nikola.pavesic@fe.uni-1j.si
(N. Pave*si+c).
40
y1
x1
z1
z2
x2
xl1
zl 3
yl 2
i = 1,..., l1
m = 1,..., l2
j = 1,..., l3
Fig. 1. Con:guration of the RBF Network (for explanation see Appendix A).
counterparts. In spite of diEerent structures and training paradigms, all NNs perform essentially the same function: vector mapping. Likewise, all NN applications
are special cases of vector mapping. Development of detailed mathematical models for
NNs began in 1943 with the work of McCulloch et al. [12] and was continued by
others.
According to Wasserman [20], the :rst publication on radial basis functions for
classi:cation purposes dates back to 1964 and is attributed to Bashkirof et al. [4] and
Aizerman et al. [1]. In 1988, based on Covers theorem on the separability of patterns
[6], Broomhead et al. [5] employed radial basis functions in the design of NNs.
The RBF network is a two layered network (Fig. 1), and the common method for
its training is the backpropagation algorithm. The :rst version of the backpropagation
algorithm, based on the gradient descent method, was proposed by Werbos [21] and
Parker [13] independently, but gained popularity after publication of the seminal book
by Rumelhart et al. [15]. Since then, many modi:cations have been oEered by others,
and Jondarr [10] has reviewed 65 varieties.
Almost all variants of backpropagation algorithm were originally devised for the
multilayer perceptron (MLP). Therefore, any variant of the backpropagation algorithm
which is used for training the radial basis function (RBF) network should be customized
to suit this network, so it will be somehow diEerent from the variant suitable for the
MLP. Using the backpropagation algorithm for training RBF network has three main
drawbacks:
overtraining, which weakens the networks generalization property,
slowness at the end of training,
inability to learn the last few percent of vector associations.
A solution oEered for overtraining problem is early stopping by employing cross
validation technique [9]. There are plenty of research reports that argue against usefulness of the cross validation technique in the design and the training of NNs. For
detailed discussions the reader is invited to see [2,3,14].
41
From our point of view, there are two major reasons against using early stopping
and the cross validation technique on our data:
(1) The cross validation stops training on both learned and unlearned data. While
the logic behind early stopping is preventing overtraining on learned data, there
is no logic for stopping the training on unlearned data, when the data is not
contradictory.
(2) In the RBF and the MLP networks, learning trajectory depends on the randomly
selected initial point. This means that the optimal number of training epochs which
is obtained by CV, is useful iE we start training always from the same initial point,
and the network always traverses the same learning trajectory!
To improve the performance of the network, the authors suggest the selective training,
as there is no other way to improve the performance of the RBF network on the given
datasets. The paper shows that if we use early stopping or continue the training with
the whole dataset, the generalization error will be much more than the results obtained
by the selective training. In [19] the backpropagation with selective training (BST)
algorithm was presented for the :rst time and was used for training the MLP network.
Based on the results obtained on our OCR datasets, the BST algorithm has the
following advantages over the basic backpropagation algorithm:
prevents from overtraining,
de-emphasizes the overtrained patterns,
enables the network to learn the last percent of unlearned associations in a short
period of time.
As there is no universally eEective method, the BST algorithm is not an exception.
Since the contradictory data or the overlapping part of the data cannot be learned,
applying the selective training on data with a large overlapping area will destabilize
the system, but it is quite eEective when dataset is error-free and non-overlapping, as it
is the case with every error-free character-recognition database, when enough number
of proper features are extracted.
Organization of the paper: The RBF network is reviewed in Section 2. In Section
3, the training algorithms are presented. Simulation results are presented in Section 4,
and conclusions are given in Section 5. In addition, the paper includes two appendices.
In most of the resources the formulations for calculating error gradients of RBF
networks are either erroneous and conOicting (for instance see the formulas 4.57, 4.60,
7.53, 7.54, 7.55 in [11]), or having not been given at all (see for instance [20,16]).
Thus in Appendix A, we obtain these formulas for three forms of output cell activity function. Appendix B presents some information about feature extraction methods
used for creating the Farsi optical character recognition datasets, which are used for
simulations in this paper.
Remark. Considering that in the classi:er every pattern is represented by its feature
vector as the input vector to the classi:er, classifying the input vector is equivalent
42
to classifying the corresponding pattern. Frequently in the paper, the vector which is
to be classi:ed has been referred to by the input pattern, or simply pattern, and vice
versa.
2. RBF networks
In this section, the structure, training paradigms and initialization methods of RBF
networks are reviewed.
2.1. Structure
While there are various interpretations of RBF, in this paper we will consider
it from the pattern recognition point of view. The main idea is to divide the input space into subclasses, and to assign a prototype vector for every subclass in
the center of it. Then the membership of every input vector in each subclass will
be measured by a function of its distance from the prototype (or kernel vector),
that is fm (x) = f(x vm ). This membership function should have four speci:cations:
1.
2.
3.
4.
In fact, any diEerentiable and monotonically decreasing function of x vm will ful:ll these conditions, but the Gaussian function is the common choice. After obtaining
the membership values (or similarity measures) of input vector in the subclasses, the
results should be combined to obtain the membership degrees in every class. The two
layered feed-forward neural network depicted in Fig. 1 is capable of performing all the
operations, and is called the RBF network.
The neurons in the hidden layer of network have a Gaussian activity function and
their inputoutput relationship is:
x vm 2
ym = fm (x) = exp
2
2m
;
(1)
where vm is the prototype vector or the center of the mth subclass and m is the
spread parameter, through which we can control the receptive :eld of that neuron.
The receptive :eld of the mth neuron is the region in the input space where fm (x)
is high.
The neurons in the output layer could be sigmoid, linear, or pseudo-linear, i.e. linear
with some squashing property, so the output could be calculated using one of the
following equations:
1
+
eSj
s
j
;
zj =
l
sj
;
m ym
43
sigmoid;
1
squashing function;
l2
1
pseudo-linear; with
squashing function;
m ym
linear; with
(2)
where
sj =
l2
ym umj ;
j = 1; : : : ; l3 :
(3)
m=1
Although in the most of literature, the neurons with linear or pseudo-linear activity
function have been considered for the output layer, we strongly recommend using the
sigmoidal activity function, since it results in less sensitivity to learning parameters,
faster convergence and lower recognition error.
2.2. Training paradigms
Before starting the training, a cost function should be de:ned, and through the training process we will try to minimize it. Total sum-squared error (TSSE) is the most
popular cost function.
Three paradigms of training have been suggested in the literature:
1. No-training: In this the simplest case, all the parameters are calculated and :xed
in advance and no training is required. This paradigm does not have any practical
value, because the number of prototype vectors should be equal to the number of
training samples, and consequently the network will be too large and very slow.
2. Half-training: In this case the hidden layer parameters (kernel vectors and spread
parameters) are calculated and :xed in advance, and only the connection weights
of output layer are adjusted through backpropagation algorithm.
3. Full-training: This paradigm requires the training of all parameters including kernel
vectors, spread parameters, and the connection weights of output layer (vm s, m s
and umj s) through the backpropagation algorithm.
2.3. Initialization methods
The method of initialization of any parameter will depend on the selected training
paradigm. To determine the initial values of kernel vectors, many methods have been
suggested, among them the most popular are:
1. the :rst samples of the training set,
2. some randomly chosen samples from the training set,
44
YU = Z;
y1
Y = ... ;
yQ
z1
Z = ... ;
zQ
yq R l 2 ;
zq Rl3 ;
(4)
where y1 ; : : : ; yQ and z1 ; : : : ; zQ are the row vectors obtained from the hidden and output
layers, respectively, in response to the x1 ; : : : ; xQ row vectors in the input layer, and
the equation YU = Z is made as follows: for each input vector in the training set xq ,
the outputs from the hidden layer are made a row in the matrix Y, target outputs are
placed in corresponding rows of target matrix Z and each set of weights associated
with an output neuron is made a column of matrix U.
Considering that in large scale problems, the dimension of Y is high and (YT Y)1
is ill-conditioned, despite super:cial appeal of the pseudo-inverse matrix method, the
:rst iterative method is the only applicable one.
45
3. Training algorithms
In this section, we will present two training algorithms for the RBF network. First
the basic backpropagation (BB) algorithm is reviewed, and then the modi:ed algorithm
is presented.
3.1. Basic backpropagation for the RBF network
Here we will consider the algorithm for the full-training paradigm; customizing it
for half-training is straightforward and can be done simply by eliminating gradient
calculations and weight-updating corresponding to the appropriate parameters.
Algorithm.
(1) Initialize network.
(2) Forward pass: Insert the input and the desired output, compute the network outputs
by proceeding forward through the network, layer by layer.
(3) Backward pass: Calculate the error gradients versus the parameters, layer by layer,
starting from the output layer and proceeding backwards: @E=@umj , @E=@vim , @E=@m2
(see Appendix A).
(4) Update parameters:
umj (n + 1) = umj (n) 3
@E
;
@umj
(5)
@E
;
@vim
(6)
@E
;
@m2
(7)
46
(4) To get better generalization performance, using the cross validation [9] method
has been reported in some cases as a stopping criterion; this, howeveras was mentioned in the introductionis unsatisfactory and unconvincing, because it stops training
on both learned and unlearned inputs.
(5) The number of output cells depends on the number of classes and the approach
of coding, however, it is highly recommended to make it equal to the number of
classes.
(6) Sometimes, in the net input of the sigmoid function or the linear output a constant
term is also considered (called threshold term), which is implemented using a constant
input (equal to 1). In some cases this term triggers the moving target phenomenon
and hinders training, and in some other cases without it there is no solution. Therefore,
it must be examined for every case, separately.
(7) In the rest of this paper our purpose from BB is the backpropagation algorithm
with sigmoid prime oEset as explained in footnote 1, without the momentum term.
3.2. Backpropagation with selective training
The diEerence between the BST algorithm and the BB algorithm lies in the selective
training, which is appended to the BB algorithm. When most of the vector associations
have been learned, every input vector should be checked individually, and if it is
learned there should be no training on that input, otherwise training will be carried
out. In doing so, a side eEect will arise: the stability problem. That is to say, when
we continue training on only some inputs, the network usually forgets the other input
output associations which were already learned, and in the next epoch of training it will
make wrong predictions for some of the inputs that were already classi:ed correctly.
The solution to this side eEect consists of considering a stability margin for the
de:nition of the correct classi:cation in the training step. In this way we also carry
out training on marginally learned inputs, which are on the verge of being misclassi:ed.
Selective training has its own limitations, and cannot be used on conOicting data,
or on a dataset with large overlapping areas of classes. Based on the obtained results,
using the BST algorithm on an error free OCR dataset has the following advantages:
prevents from overtraining,
de-emphasizes the overtrained patterns,
enables the network to learn the last percent of unlearned associations in a short
period of time.
BST algorithm.
(1) Start training with BB algorithm, which includes two steps:
forward propagation,
backward propagation.
(2) When the network has learned most of the vector mappings and the training procedure has slowed down, i.e. TSSE becomes smaller than a threshold value C,
stop the main algorithm and continue with selective training.
47
(3) For any pattern perform forward propagation and examine the prediction of the
network:
zJ = max(zj )
j = 1; : : : ; l3 ;
if (zJ zj + )
if (zJ zj + )
j = J no-prediction;
(8)
Remarks. (1) An alternative condition for starting the selective training is as follows:
after TSSE becomes smaller than a threshold value C, every T epochs carry out a
recognition test on the training set, and if it ful:lls a second threshold condition, that
is (recognition error C1 ), start selective training. The recommended value for T is
3 6 T 6 5.
(2) If is chosen large, training will be continued for most of the learned inputs,
and this will make our method ineEective. On the other hand, if is chosen too small,
during training we will face a stability problem, i.e. with a small change in the weights
valueswhich happens in every epocha considerable number of associations will be
forgotten, thus the network will oscillate and training will not proceed. After training,
by a small change in the feature vector that causes a small change in output values, the
prediction of the network will change, or for feature vectors from the same class but
with minor diEerences, we will have diEerent predictions. This also causes vulnerability
to noise and weak performance on both the test set and on the real world data out of
both the training set and the test set. The optimum value of should be small enough
to prevent training on learned inputs, but not so small as to give way to changing the
winner neuron with minor changes in weights values or input values. Our simulation
results show that for the RBF network a value in the range [0:1; 0:2] is the optimum
for our datasets.
(3) It is also possible to consider a no-prediction state in the :nal test, that is
zJ = max (zj )
j = 1; : : : ; l3 ;
if (zJ zj + 1 )
if (zJ zj + 1 )
j = J no-prediction
(9)
48
in which 0 1 . This no-prediction state will decrease the error rate at the cost
of decreasing the correct prediction rate.
4. Experiments
In this section, :rst we give some explanation about datasets, on which simulations
have been carried out. Then simulation results are presented.
4.1. Datasets
A total of 18 datasets composed of feature vectors of 32 isolated characters of the
Farsi alphabet, sorted in three groups, were created through various feature extraction
methods, including: principal component analysis (PCA), vertical, horizontal and diagonal projections, zoning, pixel change coding and some combinations of them, with
the number of features varying from 4 to 78 per character according to Table 1.
For creating these datasets, 34 Farsi fontswhich are used in publishing online
newspapers and web siteswere downloaded from the Internet. Fig. 2 demonstrates a
whole set of isolated characters of one sample font printed as text. Then, 32 isolated
characters of these 34 Farsi fonts were printed in an image :le; 11 sets of these
fonts were boldface and one set was italic. In the next step, by printing the image
:le and scanning it with various brightness and contrast levels, two additional image
:les were obtained. Then, using a 65 65 pixel size window, the character images
were separated into images of isolated characters. After applying a shift-radius invariant
Table 1
Brief information about Farsi OCR datasets
Database
Extraction method
Group A
db1
db2
db3
db4
db5
PCA
PCA
PCA
PCA
PCA
Group B
Group C
dbn1 to dbn5
db6
db7
db8
db9
db10
db11
db12
db13
PCA
Zoning
Pixel change coding
Projection
Projection
Projection
Explanation
No. of features
per character
72
54
72
64
48
4
48
48
30
78
52
52
72
49
50
Table 2
Comparing the recognition errors of the RBF network obtained by the BB and the BST algorithms
BB
BST
Epoch
Error
Train
db1
db2
100
100
db3
db4
db5
db7
Parameters
Epochs
Error
Test
n; N
Train
11
22
25
32
100
100
100
100
11
9
14
63
7
6
6
31
db8
db10
100
100
83
20
40
10
db11
db12
100
100
49
68
28
32
db13
dbn1
100
100
4
12
2
18
dbn2
100
27
dbn3
100
dbn4
dbn5
100
100
3
17
1
9
64, 104
57, 90
100, 140
51, 90
39, 55
58, 95
65, 105
100, 140
77, 115
68, 108
100, 140
48, 65
77, 117
100, 140
49, 60
71, 113
100, 120
64, 100
100, 140
53, 93
100, 130
45, 75
65, 100
0
4
2
0
0
0
3
2
14
2
1
0
5
5
0
0
0
4
0
0
0
0
0
First phase
Second phase
Test
3 , 2 , 1
3 , 2 , 1
17
22
19
0
0
0
1
0
7
1
0
0
1
1
0
16
13
20
15
0
0
0
0
5; 0.04; 0.001
5; 0.04; 0.001
5; 0.04; 0.001
5; 0.04; 0.001
8; 0.04; 0.001
5; 0.04; 0.001
1.4; 0.001; 5e-6
1.4; 0.001; 5e-6
3; 0.001; 5e-6
6; 0.001; 5e-6
6; 0.001; 5e-6
3; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
4; 0.001; 5e-6
4; 0.001; 5e-6
5; 0.001; 5e-6
5; 0.001; 5e-6
4; 0.001; 5e-6
4; 0.001; 5e-6
Threshold
2
60
110
80
60
80
220
220
120
170
200
80
50
100
0
60
70
80
6
5
5
6
7
6
0.05
0.05
0.2
0.2
0.2
0.05
0.15
0.15
0.15
0.5
0.5
0.4
0.4
0.5
0.5
0.6
0.6
Database
51
(10)
where i2 is the variance of the training patterns of the ith cluster, de:ned by
1
(xq mi )2 ; for i = 1; : : : ; 32
(11)
i2 =
68 q
x cluster i
52
algorithm, then followed with selective training for maximum of 40 epochs. The
obtained results are given under column BST, where the threshold value has not
been speci:ed, or where n is equal to 100.
4.2.3. Analysis
(1) Results obtained by the BB algorithm: The recognition errors on datasets dbn4,
db13, dbn3, db4, db3 and db5 are lower than those on others, and db12, db8 and db7
must be considered the worst ones. Data normalization has improved the recognition
rate in all cases, excluding dbn5. We notice that the performances on test sets of
db1 and db2 are weaker than that on training sets of these datasets, and this must
be attributed to the inappropriate implementation of the feature extraction method.
The learning rates of kernel vectors and spread parameters, i.e. 2 and 1 are much
smaller for datasets of groups B and C than those for the datasets of group A, but
the value of 3 (the learning rate for weight matrix of the output layer), does not
change substantially for datasets from diEerent groups. The initial spread parameters
for datasets of group A are much larger than for the datasets of groups B and C.
(2) The results obtained by the BST algorithm: The :rst eminent point is the
decreased recognition errors on all datasets. The BST algorithm has achieved much
better results in shorter time; especially on db3, db4, dn5, dbn3, dbn4, dbn5, db11,
and db13 the recognition error has reached zero. For evaluating and ranking these
datasets we have two other measures: convergence speed and the number of features.
Regarding convergence speed, the best datasets will be: db4, db13, db11, dbn4, db3,
dbn3, db5, dbn5, although in some cases the diEerences are too slight to be meaningful;
db3, dbn3 and db13 should be ruled out because of high dimension of their feature
vectors. In addition, training does not bene:t from data normalization. Also the BST
algorithm solves the overtraining problem. We notice that it has decreased the error
rate on the training set but not at the cost of increased error on the test set. And this
is even more obvious from the results obtained on db1 and db2. In addition, this can
be veri:ed from the results demonstrated in Table 3.
(3) In Table 3 we have compared the recognition errors at the epochs n and N , i.e.
at the beginning and the end of the selective training obtained by the BST algorithm,
against recognition error at the same epochs obtained by the BB algorithm, on four
datasets. It shows that after TSSE reaches to the threshold value, if we continue training
with basic backpropagation the recognition error either decreases very trivially or even
increases (e.g. on test set of db4 and db11). Some researchers use the cross validation
technique to :nd this point and stop training at it, but we are opposing applying the
cross validation method on neural networks training.
(4) Table 4 shows the results of another experiment performed on db4 with diEerent
network settings. Threshold value was set equal to 100, stability margin equal to 0.1,
and learning rate parameters divided only by two for the second phase of training, and
the initial weight matrix was changed. While by epoch number 22 the error rates on
both training set and test set are equal, i.e. 26 and 14, respectively, the BST algorithm
has reached zero error on both sets in 23 epochs of selective training, and the error rates
of the BB algorithm after 100 epochs of unselective training are 6 and 3, respectively,
on training set and test set.
53
Table 3
Comparing the recognition errors of the BB and the BST algorithms in two points of training
Database
BB
Epoch
BST
Error
Epoch
Train
Test
Error
Train
Test
db1
65
104
20
11
29
25
65
104
20
0
29
17
db4
40
100
9
9
4
6
40
55
9
0
4
0
db8
78
115
89
83
41
40
78
115
89
14
41
7
db11
48
100
50
49
26
28
48
65
50
0
26
0
Table 4
Comparing the recognition errors of the RBF network on db4 obtained by the BB and the BST algorithms
BB
BST
Epoch
Error
Train
22
100
26
6
Epoch
Error
Test
n; N
Train
Test
14
3
22, 45
100, 120
0
0
0
0
(5) The reader should recall that in the selective training mode, the calculation
for weight updating (or backpropagating)which is the most time-consuming step of
trainingis carried out only for misclassi:ed patterns, whose number at the beginning
of selective training is less than 89, or 5% of all training samples (see Tables 2 and 3);
thus one epoch of selective training is at least :ve times faster than that of unselective
training, and by decreasing misclassi:ed patterns through time it becomes faster and
faster. Therefore the BST algorithm is at least three times faster than the BB algorithm.
(6) Fig. 3 demonstrates TSSE versus the epoch number, obtained on db4, corresponding with the experiment of Table 4. By changing the training algorithm we face
a sudden descent in TSSE, and this must be attributed to the sharp decrease of learning
rate factors. Our explanation for this phenomenon is as follows:
After approaching the minimum well, by using large values of learning rate and
momentum factor we step over the well. But by decreasing the step size we put our
leg in the middle of the minimum well and face a fall to the bottom. This phenomenon
inspired us to devise the BDLRF [17] and BPLRF [19] algorithms. BDLRF and BPLRF
54
TSSE
1000
BST
100
BB
10
0
20
40
60
80
Epoch number
100
120
Fig. 3. Convergence diagrams of the RBF network obtained by the BB and the BST algorithms on db4,
corresponding to Table 4.
are acronyms for backpropagation with declining leaning rate factor and backpropagation with plummeting leaning rate factor, respectively. In [17,19] we have shown
how to speed up training and improve the recognition rate in MLP by decreasing the
learning rate factor. Also we have shown that by larger decrease in the values of
training factors can result in larger decrease in cost function, and better recognition
rate.
(7) In addition, we notice that by training in the second phase while the recognition
error decreases, TSSE increases, which substantiates our statement that our method
does not overtrain the network on learned patterns. On the contrary, if the network has
been overtrained on some patterns in the :rst phase, by increasing TSSE and decreasing
recognition error it is de-emphasizing on already overtrained patterns. In other words,
decreased recognition error on unlearned patterns must be a resultant of decreased SSE
(sum-squared error) resulting from the same patterns. Thus, for an increase of TSSE,
simultaneous with decrease of recognition error, there has to be an increase in the SSE
resulting from already learned patterns without crossing the stability margin, and this
means de-emphasizing on overtrained patterns. Therefore, our method decreases the
error on the training set, but not at the cost of overtraining and increased error on the
test set.
4.2.4. Considerations
(1) By starting from a diEerent initial point, the number of training epochs will
change slightly, but not so much as to aEect the general conclusions.
(2) As already mentioned, we considered three types of activity functions for
the output layer: sigmoid, linear, and pseudo-linear. And we faced numerous problems
55
56
Table 5
Comparing performance of the RBF network with diEerent initialization policies
BB
BST
Epoch
Error
Train
22
100
21
100
27
100
38
100
26
6
25
9
31
24
27
19
Initial values
Epochs
Error
Test
n; N
Train
Test
14
3
14
6
16
14
13
9
22, 45
100, 120
21, 45
100, 110
27, 45
100, 104
38, 78
100, 110
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Kernel vectors
2
First samples
First samples
From k-means
k-Means cluster
Centers
From k-means
Although the last method of initialization does yield faster convergence, nevertheless
the diEerence between these two types of initialization becomes trivial when the number
of kernel vectors grows smaller. More precisely, when the number of kernel vectors
is kept small, using the second initialization method speeds up convergence only at
the very beginning, but in the middle and at the end of training convergence slows
down, and the global convergence is not better than that obtained by the :rst method
(see Table 5). Notwithstanding, before training the RBF network, we need to run the
k-means algorithm to get initial values for the spread parameters, and using the created
cluster centers can be done in no time.
(5) A major drawback of the RBF network lies in its size, and therefore its speed.
Unlike the MLP, in the RBF networks we cannot increase the size of network incrementally. For instance, in our case study the network size was l1 32 32; if we
had decided to enlarge the network, the next size would have been l1 64 32, and
this means that the network size would be doubled. Considering that speed will decrease by an order larger than one, the improved performance would cost a substantial
slow-down both in training and on-line operation, which would make the RBF network
unable to compete with other networks, and therefore be practically useless. Consequently, usage of the RBF network is not recommended if the number of pattern classes
is high.
(6) The best policy for initial values of spread parameters is to set equal initial values
for all of them, but to change them through training. Considering kernel vectors, the
most important aspect is adjusting them during training. In this case, selecting the :rst
samples or the prototype vectors derived from the k-means algorithm yield similar
results (see Table 5).
(7) The data normalization method oEered by Wasserman was tried [20, p. 161].
Wasserman oEers, for each component of the input training vectors:
1. Find the standard deviation over the training set.
57
2. Divide the corresponding component of each training vector by this standard deviation.
Considering that some components of feature vectors are equal for all patterns except for some from speci:c classes, the standard deviations of these components are
very small and by dividing these components to their standard deviations, their values
extremely increase, and this attenuates the impact of other components in the norm
of the diEerence vector xq vm , almost to zero. The mere impact of Wassermans
normalization method was destabilizing the whole system.
(8) Decreasing the learning parameters during selective training, stabilizes the network and speeds up training by preventing repeated overtraining and unlearning on
some patterns.
5. Conclusions
(1) In this paper we presented the BST algorithm, and showed that on the given
datasets, the BST algorithm improves the performance of RBF networks substantially, in terms of convergence speed and recognition error. The BST algorithm
achieves much better results in shorter time. It solves three drawbacks of the backpropagation algorithm: overtraining, slow convergence at the end of training, and
inability to learn the last few percent of patterns. In addition, it has the advantages of shortening training time (up to three times) and partially de-emphasizing
overtrained patterns.
(2) As there is no universally eEective method, the BST algorithm is not an exception and has its own shortcoming. Since the contradictory data or the overlapping
part of the data cannot be learned, applying the selective training on data with a
large overlapping area will destabilize the system. But it is quite eEective when
dataset is error-free and non-overlapping, as it is the case with every error-free
character-recognition database, when enough number of proper features are extracted.
(3) The best training paradigm is full training, because it utilizes all the capacity of
the network. Using the sigmoidal activity function for the neurons of the output
layer is recommended, because it results in less sensitivity to learning parameters,
faster convergence and lower recognition error.
Acknowledgements
This work has been partially supported within the framework of the Slovenian
Iranian Bilateral Scienti:c Cooperation Treaty. The authors would like to thank the
anonymous reviewers for their valuable comments and suggestions which helped to
improve the quality of the paper. We would like to thank Alan McConnell DuE for
linguistic revision.
58
(q = 1; : : : ; Q)
(A.1)
(A.2)
We will calculate error gradients for pattern mode training. Obtaining error gradients
for batch mode training is straightforward, as explained in the remark at the end of
this appendix.
We will consider three types of activity functions for output cells
; sigmoid;
1 + eSj
s
1
j
;
linear; with
squashing function;
zj =
(A.3)
l
l
2
2
sj
1
;
pseudo-linear; with
squashing function;
y
m ym
m m
where
sj =
ym umj
(A.4)
and
x vm 2
ym = exp
2m2
:
(A.5)
59
III
(A.6)
@E
= 2 (tj zj )
@zj
Computing (II ):
l2
@zj
=
II =
1
@sj
m
(A.7)
II =
(A.9)
@sj
= ym :
@umj
(A.10)
ym
2(tj zj )
;
case(1);
l2
@E
= 2(tj zj ) ym ;
case(2);
@umj
m ym
(A.11)
II
III
(A.12)
@E
= 2(tj zj ):
@zj
(A.13)
60
sj
l2
II =
sj
ym
@zj
umj
=
:
@ym
l2
m ym umj
m ym
(A.14)
(A.15)
(A.16)
(A.17)
!zj
!zj !sj
=
;
!ym
!sj !ym
(A.18)
considering
zj =
and
sj =
1
;
1 + esj
umj ym ;
(A.19)
(A.20)
we have
IV = zj (1 zj );
(A.21)
V = umj :
(A.22)
and
@zj
= zj (1 zj )umj :
@ym
(A.23)
(A.24)
then
@ym
= ym
III =
@vim
xim vim
m2
61
:
2(tj zj )
(xim "im );
l2 m2
umj k yk sj ym
@E
2(tj zj )
(xim "im );
=
2
(
m2
@"im
k yk )
J
ym
m
j
(A.25)
case (1);
case (2);
(A.26)
case (3):
II
III
(A.27)
terms I and II are exactly as in part 2, therefore we only need to calculate the third
term, which will have an identical formulation in all three cases
x vm 2
(A.28)
ym = exp
2m2
and
III =
x vm 2
@ym
= ym
2
@m
2m4
2(t
z
)
y
j
j
m
l2
2m4
umj k yk sj
x vm 2
@E
;
2(t
z
)
y
j
j
m
=
( k y k )2
2m4
@m2
x vm 2
2(t
z
)z
(1
z
)u
y
j
j
j
j
mj
m
2m4
j
(A.29)
case(1);
case(2);
(A.30)
case(3):
62
63
db9: The feature vectors of this dataset were extracted by diagonal projection. Ten
components from the beginning and seven components from the end were deleted,
because their values were zero for all characters. The best recognition rate on
this dataset does not reach 85%, so its features were used only in combination
with other features.
db10: By concatenating the feature vectors of db8 and db9, feature vectors of this
dataset were extracted.
db11: The feature vectors of this dataset were created by concatenating the feature
vectors of db6 and db7.
db12: The feature vectors of this dataset were created by concatenating the feature
vectors of db6 and db8.
db13: The feature vectors of this dataset were created by concatenating the feature
vectors of db11 and some selected features from db8, that is 10 features from
the middle of both vertical and horizontal projections.
References
[1] M.A. Aizerman, E.M. Braverman, I.E. Rozonoer, Theoretical foundations of the potential function
method in pattern recognition learning, Automat. Remote Control 25 (1964) 821837.
[2] S. Amari, N. Murata, K.R. Muller, M. Finke, H.H. Yang, Asymptotic statistical theory of overtraining
and cross-validatio, IEEE Trans. Neural Networks 8 (5) (1997) 985996.
[3] T. Andersen, T. Martinez, Cross validation and MLP architecture selection, Proceedings of International
Joint Conference on Neural Networks, IJCNN99, Cat. No. 99CH36339, Vol. 3 (part 3), 1999,
pp. 1614 1619.
[4] O.A. Bashkirov, E.M. Braverman, I.B. Muchnik, Potential function algorithms networks for pattern
recognition learning machines, Automat. Remote Control 25 (1964) 629631.
[5] D.S. Broomhead, D. Lowe, Multivariable functional interpolation and adaptive networks, Complex
Systems 2 (1988) 321355.
[6] T.M. Cover, Geometrical and statistical properties of systems of linear inequalities with application in
pattern recognition, IEEE Trans. Electron. Comput. EC-14 (1965) 326334.
[7] K.I. Diamantaras, S.Y. Kung, Principal Component Neural Networks: Theory and Applications, Wiley,
New York, 1996.
[8] S.C. Fahlman, An empirical study of learning speed in backpropagation networks, Technical Report
CMU-CS-88-162, Carnegie Mellon University, Pittsburgh, PA 15213, September 1988.
[9] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall, Englewood CliEs, NJ, USA,
1999.
[10] G. Jondarr, Backpropagation family album, Technical Report, Department of Computing, Macquarie
University, New South Wales, August 1996.
[11] C.G. Looney, Pattern Recognition Using Neural Networks, Oxford University Press, New York, 1997.
[12] W.S. McCulloch, W. Pitts, A logical calculus of the ideas imminent in nervous activity, Bull. Math.
Biophy. 5 (1943) 115133.
[13] D.B. Parker, Learning logic, Invention Report S81-64, File 1, OXce of Technology Licensing, Stanford
University, March 1982.
[14] L. Prechelt, Automatic early stopping using cross validation: quantifying the criteria, Neural Networks
11 (4) (1998) 761767.
[15] D.E. Rumelhart, J.L. McClelland, Parallel Distributed Processing: Exploration in the Microstructure of
Cognition, MIT Press, Cambridge, MA, 1986.
[16] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Academic Press, USA, 1999.
[17] M.T. Vakil-Baghmisheh, N. Pave*sic, Backpropagation with declining learning rate, Proceeding of the
10th Electrotechnical and Computer Science Conference, Portoro*z, Slovenia, Vol. B, September 2001,
pp. 297300.
64
[18] M.T. Vakil-Baghmisheh, Farsi character recognition using arti:cial neural networks, Ph.D. Thesis,
Faculty of Electrical Engineering, University of Ljubljana, Slovenia, October 2002.
[19] M.T. Vakil-Baghmisheh, N. Pave*si+c, A fast simpli:ed fuzzy ARTMAP network, Neural Process. Lett.,
2003, in press.
[20] P.D. Wasserman, Advanced Methods in Neural Computing, Van Nostrand Reinhold, New York, 1993.
[21] P.J. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral science, Ph.D.
Thesis, Harvard University, Cambridge, MA, 1974.
Mohammad-Taghi Vakil-Baghmisheh was born in 1961 in Tabriz, Iran. He received his B.Sc. and M.Sc. degrees in electronics, from Tehran University in 1987
and 1991. In 2002, he received his Ph.D. degree from the University of Ljubljana, Slovenia, with a dissertation on neural networks in the Faculty of Electrical
Engineering.
Nikola Pave0si1c was born in 1946. He received his B.Sc. degree in electronics,
M.Sc. degree in automatics, and Ph.D. degree in electrical engineering from the
University of Ljubljana, Slovenia, in 1970,1973 and 1976, respectively. Since 1970
he has been a staE member at the Faculty of Electrical Engineering in Ljubljana,
where he is currently head of the Laboratory of Arti:cial Perception, Systems and
Cybernetics. His research interests include pattern recognition, neural networks,
image processing, speech processing, and information theory. He is the author and
co-author of more than 100 papers and 3 books addressing several aspects of the
above areas.
Professor Nikola Pave*si+c is a member of IEEE, the Slovenian Association of
Electrical Engineers and Technicians (Meritorious Member), the Slovenian Pattern
Recognition Society, and the Slovenian Society for Medical and Biological Engineers. He is also a member
of the editorial boards of several technical journals.