Professional Documents
Culture Documents
Wieland’s two-spiral problem is often used as a test collection of public-domain benchmark problems [6],
for comparing the quality of different supervised- consists of learning to discriminate between two sets
learning algorithms and architectures. In this paper, of training points which are distributed along two
we use this two-spiral problem to illustrate the spirals in the plane. These spirals make three com-
advantages obtained from using all the additional plete turns round the origin and around each other.
knowledge about the problem domain in designing The set of training points (input, output) are
the neural net which solves a given problem. The represented graphically in Fig. 1; the positive output
characteristics of the knowledge-based net, with values are marked with a cross and the negative
regard to complexity, number of elements, training ones with a circle. The values of the coordinates,
speed and generalisation quality, make it appreci- for the 194 points which form the training set, have
ably better than alternative nets which make no use been generated using the program provided by Alex
of this knowledge. Wieland [6]. Points are placed on 32 angles and
correspond to three complete turns for each spiral
Keywords: Backpropagation; Benchmark; Generalis- (2·(32 · 3 + 1) = 194 points), the radius varying from
ation; Knowledge based design; Neural networks;
Two spirals of Wieland
Correspondence and offprint requests to: J.R. Álvarez-Sánchez, Fig. 1. Cartesian plane representation of the data used in training
Dpto. Inteligencia Artificial, Facultad de Ciencias-UNED, Senda a neural net to solve the two-spiral problem. The crosses represent
del Rey, s/n, E-28040 Madrid, Spain. E-mail: jras얀dia.uned.es one output class and the circles the other.
266 J.R. Álvarez-Sánchez
6.5 to 0.5 on one spiral and from −6.5 to −0.5 on University of Carnegie Mellon’s repository of
the other. The values of the training set are the examples. All the reports of this repository deal
pairs of Cartesian coordinates of each point and a with neural nets with feedforward architectures, lin-
value indicating to which spiral they belong. ear-type units (neurons), non-linear sigmoid transfer
The net is trained on the 194 pairs of input/output and learning through error backpropagation by gradi-
(point/class) until it can produce the correct output ent descent. The main difference between the differ-
for each input of the example points. The time ent reports lies in the variations in the learning
required for this training is a measure of the quality algorithm, the layering structure and the number of
of the net, understood as a Problem Solving elements per layer. Table 1 gives a comparison of
Method (PSM). the results of these reports, together with the results
The choice of which output values are to represent obtained in our work (bottom row), these latter
the two classes is left up to the experimenter’s results being detailed in later sections.
judgement. In the documentation for this example, The comparison table includes data about the type
it is suggested that, to compare results, a correctness of learning algorithm: ‘BP’ refers to the standard
criterion of 40-20-40 be used for the representation gradient-descent backpropagation algorithm, ‘quick
of bipolar outputs across a range of values, assigning prop’ refers to an improvement introduced in Fahl-
each of the top and bottom 40% of the range, man [1], ‘cascade correlation’ refers to an algorithm
respectively, to the two classes, and leaving the which generates the structure of the net (in cascade)
central 20% as undefined (incorrect) output. This during the training [7], and ‘resilient propagation’
criterion is chosen to make it easier for neurons [3] refers to a modification of the backpropagation
with sigmoid output to give a correct response, but algorithm in which the weights are incremented
is not very precise. In the work reported on in this or decremented as a function of the sign of the
paper, we have adopted a much more restrictive error gradient.
criterion, equivalent to 10-80-10 with a range of −1 ‘Cascade correlation’ algorithms use a technique
to 1; that is, we consider an output between 0.8 in which the weights in hidden layers are frozen so
and 1 as an indication of belonging to class ‘+’, an that the effective number of weights modified in
output between −1 and −0.8 as an indication of each example of a given epoch are reduced, on
belonging to class ‘−’ and the intermediate values, average, (from a total of n(n+5)
2
weights for n neurons
those between −0.8 and 0.8, as incorrect. in cascade, one for each layer, all of which receive
Formulated in this way, the problem is a difficult the inputs to the net) to a value which can be
one for classical backpropagation nets, and is estimated as a function of the total number of
impossible to solve using a linear separator. For this changes (referred to as ‘cross connections’ in Fahl-
reason, it is used to compare how the training time man and Lebiere [7]) divided by the total number
varies for different algorithms and according to the of examples presented in all the epochs (that is,
number of weights (or their equivalent) in the net. divided by the number of epochs and by the number
The rest of the paper is divided into six sections. of examples), and divided by two modification
The following section summarises the most signifi- phases (forwards and backwards). This estimated
cant reference results for this problem. In Section 3, equivalent number of modifiable weights is the one
an analysis is made of the notion of adding previous given in the table (the values in italics) for the two
knowledge about the solution to the problem, and cases of type ‘cascade correlation’ which appear
of the method used to incorporate this knowledge there. To calculate the equivalent values of the
in the design of the net. weights, we have used the ‘cross connections’ data
Based on the information obtained, we then from Treadgold and Gedeon [4]. In the case of GCS
present the design of the net which we use to from Fritzke [8], the number of weights was counted
solve this problem in Section 4. In Section 5, the by hand in the connections figure.
information about the problem obtained in Section 3 In the last column of the table the average number
is used to calculate adequate initial values for the of epochs for the training is indicated and, in some
weights of the proposed net. In Section 6 the simul- cases, also the corresponding number of epochs for
ation results for this net are presented. the same problem when the point density is quad-
rupled (in brackets). The indications of the technique
quality are the lowest number of epochs and the
2. Summary of Previous Results lowest number of weights together.
To illustrate the type of generalisation usually
Here, we summarise the main aspects of the results obtained in the results, Fig. 2 shows the response
obtained by different authors and contained in the of a net, in the region of the plane used in the
Injecting Knowledge into the Solution of the Two-Spiral Problem 267
Table 1. Summary of the results for the set of examples available in the C.M.U. repository.
BP = Backpropagation, = learning rate, = momentum coefficient, cascor = cascade correlation, RPROP = Resilient Backpropagation,
GCS = Growing Cell Structures, shortcut = neurons in a layer receive inputs from all previous layers, SSE = Sum Squared Error.
N = Number of neurons, #w = number of adjustable weights, tol = error tolerance (in % of range) for correct classification, #t = number
of tests.
the training examples, but it also divides the whole spiral equation in this latter equation and grouping
plane into two spiral regions of opposite polarity, terms, we obtain the expression
i.e. including the points not given to the net as
examples. (w1 + w2p) + w2(2np + ro) + b
= + 2m; n,m 苸 Z
2
4. Selection of Pre-processing and
Transference Function Since this equation must be valid for any angle,
we can separate it into two parts: the component
To take advantage of the central symmetry property and the rest.
冦
discussed in the previous section, we pre-process
the inputs by transforming them into polar co- (w1 + w2p) = 0 ∀ ⇒ w1 = −w2p
ordinates. The results of this transformation are then
used as the terms of a weighted sum (adjustable 2nw2p + w2ro + b = + 2m n,m 苸 Z (1)
2
weights). Due to the periodic alternation of the two
classes of points, we use a differentiable transference The second part of Eq. (1) can be separated into
function of periodic type, such as sine. terms relative to the integers and all other terms.
The transformation from Cartesian to polar co- This then leaves us with equations which relate the
ordinates can be performed in two separate units weights in function of the values measured on the
(without learning) constituting the first layer, training data pⴱ and rⴱo, which will be approximately
which generates the values r = √x2 + y2 and
= arctan ( xy ) + · sign(y). This latter function for
, which is expressible in many programming langu-
ages as atan2(y,x), gives the angle (between −
and ), but preserving the information of the quad-
rant, information which is lost with the standard
arctangent function. The second (output) layer has
only one unit, which generates the output of the net
in the form Z = sin(w1 + w2r + b), where the
Fig. 4. Scheme of blocks and connections in the solution pro-
weights w1 and w2, together with the displacement posed for 2-dimensional classification problems with central or
b, are modifiable through learning. spiral symmetry.
Injecting Knowledge into the Solution of the Two-Spiral Problem 269
equal to p and ro, the parameters which represent given examples (shown superimposed) is illustrated.
the spiral (though the values p and ro are not unique Comparing this with the response typical of other
for each theoretical spiral, if negative values are nets, such as that shown in Fig. 2, it can be readily
considered). We assign initial values to the weights observed that it is more generic. Furthermore, the
w2 = npⴱ
m
; w1 = −w2pⴱ and to the displacement net can be seen to really represent the concept
b = 2 − w2rⴱ0. The values of the integer variables m
of a spiral, which is implicit in the generalisation
and n are chosen not to be divisible by 2, so that assumptions that are part of the objectives, but
an odd number of spirals of each class alternate in which only exists from the observer’s point of view
the solution. and not in the specifications themselves.
This type of initialisation relation allows the rep- To implement the transfer function as a sine
resentative values of the spiral to which the example function, as required, it was only necessary to write
points belong to be deduced, after the adjustment it and its derivative (the cosine) in the form used
through learning, the value of p and ro being by the simulation tool. We also needed to write a
obtained from the previous formulas by putting n small program for generating the training examples,
and m equal to 1 (canonical representative). the initial values of the weights (as described in the
last section), and the calls to the subroutines of the
tool for the training and the visualising of the
6. Experimental Results results. Figure 6 shows an example of the response
of the net for the initial values of the weights before
To confirm the previous theoretical studies, we car- the training. It can be seen that this first solution
ried out experiments using the commercial tool for already approximates the desired one, though train-
simulating traditional neural nets contained in the ing is needed to tune the response completely.
MATLAB ‘neural net toolbox’. More concretely, we To generate different values of the ratio between
used the standard algorithm for learning by acceler- the integers m and n, which is used to calculate the
ated backpropagation. This algorithm employs a initial values of the weights, mn was assigned to the
momentum term in the modification of the weights, value of the formula ‘1+0.2ⴱrandn+2ⴱfloor
and can increase or decrease the learning coefficient (randⴱ2)’. This generates a distribution lying
in the course of the training if the quadratic error between 1 and 5 while avoiding multiples of 2 (or
does not diminish. An example of the result of values close to this), giving rise to solutions with
training the net is shown in Fig. 5, where its interleaved multiple spirals. Such a solution is shown
response in the region of the plane where it was
the probability of the error remaining within a given maximum value of the sum of the quadratic error
interval is 99.9%, the maximum error (as a percent- in the examples.
age of the range) must be 3.5 times the standard 100 training tests were carried out for the net we
deviation. From this condition, the quadratic error designed, with different initial values for the
sum criterion as a function of the maximum per- weights, according to the ratios deduced in the
centage of the error range for each example, can be previous section. An average of 96 epochs was
obtained. In our case, we have chosen to consider obtained (with deviation 120, minimum of 0 and
maximum of 438 epochs). The maximum error in
each example after each simulation run was 9.8%
and therefore less than the 10% proposed as the
admissible limit.
Simulations were also carried out with a quad-
ruple density of training points. An average of 102
epochs was obtained over 100 runs (with deviation
120, minimum 0 and maximum 543 epochs). The
maximum error for each example after each simul-
ation run was 9.6% (in this case the criterion for
the maximum value of the sum of the quadratic
error was 2.5). With the proposed architecture, there
is no substantial increase in the number of training
epochs when the number of examples is quadrupled
(unlike other architectures). This is due to the
capacity of this architecture to directly represent the
structure, with no dependence on the number of
examples if these all belong to the same spirals.
7. Conclusions
Fig. 7. Example of convergence to values which represent inter-
leaved triple spirals, obtained after 229 training epochs, with In this article we present the design of a neural net
final weights (−2.995, −9.375) and displacement (bias) 6.083. to solve the well-known two-spiral problem. In the
Injecting Knowledge into the Solution of the Two-Spiral Problem 271
Fig. 9. Output of the net (Z) in response to the input of the points in the zone of interest of the XY plane. (a) Output of a correctly-
trained net in the case of homogeneous, centred, symmetric spirals; (b) output of a net with non-unitary coefficients in the layer
implementing the conversion to polar (elliptic) coordinates, concretely: r = √0.3x2 + y2; = arctan (−0.2y,x); and weights in the output
layer w1 = −2, w2 = 1, b = 0.
design phase, we have used knowledge about the necessarily equidistant or homogeneous). This modi-
problem to improve the response of the net. The fication consists in also allowing the parameters of
net obtained is simpler (smaller number of weights) the pre-processing layer (conversion to polar
and needs fewer training epochs than other neural coordinates) to be modifiable (non-homogeneous
nets commonly used to solve the same problem. We elliptic coordinates, in the case where the coef-
have also verified that the type of response generated ficients are not equal to 1). This change enables
by the net designed in this article, is of better other problems, like the example shown in Fig. 9(b),
quality from the point of view of the generalisation to be solved. This figure can be compared to
of the two-spiral problem. Fig. 9(a), in which the typical response for the con-
The architecture proposed and developed in the crete example of the spirals presented in this paper
work reported on here is, apparently, specific to the is shown.
two-spiral problem. However, this is also the case
for the architectures with which it is compared Acknowledgements. This work has been partially
(Section 2). These architectures have also been supported by the project TIC-97-0604 of the Comi-
designed specifically for this problem, selecting the sión Interministerial de Ciencia y Tecnologı́a
algorithm, the number of neurons, the number of (CICYT) of Spain.
layers (except in the ‘cascade’ algorithms case), the
type of activation function, the error function, the
initialisation of the weights, etc. References
The essential characteristic of a neural net is its
parametric nature. Starting from a skeletal model 1. Fahlman SE. Faster-learning variations on back-propa-
with a multilayered architecture, we select the pre- gation: an empirical study. In: Touretzky DS, Hinton
GE, Sejnowski TJ (eds), Proceedings Connectionist
processing to apply to the inputs, the connectivity, Models Summer School, Morgan Kaufmann, 1988
the local function of each neuron and the learning 2. Lang KJ, Witbrock MJ. Learning to tell two spirals
algorithm. The design of this architecture is knowl- apart. In: Touretzky DS, Hinton GE, Sejnowski TJ
edge injected a priori into the net. The rest, the (eds), Proceedings Connectionist Models Summer
supervised learning, is responsible for the adjustment School, Morgan Kaufmann, 1988
3. Riedmiller M, Braun H. A direct adaptive method for
of the parameters, thus projecting the generic net faster backpropagation learning: the RPROP algorithm.
and its pre-processing onto the specific net (with Proceedings IEEE International Conference on Neural
already-adjusted parameter values) which best solves Networks, IEEE, San Francisco, 1993
the problem. 4. Treadgold NK, Gedeon TD. A cascade network algor-
The net designed here for the two-spiral problem ithm employing progressive RPROP. In: Mira J, Mor-
eno-Dı́az R, Cabestany J (eds), Biological and Arti-
can be easily modified to solve other problems of ficial Computation: From Neuroscience to Technology.
the same family, i.e. problems involving points with Lecture Notes in Computer Science 1240, Springer-
central symmetry and alternation in bands (not Verlag, 1997
272 J.R. Álvarez-Sánchez
5. Russell S, Norvig P. Artificial Intelligence: A Modern 8. Fritzke B. Growing cell structures – a self-organizing
Approach. Prentice-Hall, 1995; 625–648 network for unsupervised and supervised learning.
6. Wieland A. Two spirals. In: White M (ed), http: Neural Networks 1994; 7: 1441–1460
//www.boltz.cs.cmu.edu/benchmarks/two-spirals.html. 9. Fahlman SE. Cascade neural network simulator v1.0
CMU Repository of Neural Network Benchmarks, (public domain software). In: White M (ed), http:
1988 //www.boltz.cs.cmu.edu/software.html#Cascade. CMU
7. Fahlman SE, Lebiere C. The cascade-correlation learn- Repository of Neural Network Benchmarks, 1988
ing architecture. In: Touretzky DS (ed), Advances in 10. Augusteijn MF, Steck UJ. Supervised adaptive clus-
Neural Information Processing Systems 2, Morgan tering: A hybrid neural network architecture. Neural
Kaufmann, 1990 Computing & Applications 1994; 78–89