You are on page 1of 8

Neural Comput & Applic (1999)8:265–272

 1999 Springer-Verlag London Limited

Injecting Knowledge into the Solution of the Two-Spiral


Problem
J.R. Álvarez-Sánchez
Departamento Inteligencia Artificial, UNED, Madrid, Spain

Wieland’s two-spiral problem is often used as a test collection of public-domain benchmark problems [6],
for comparing the quality of different supervised- consists of learning to discriminate between two sets
learning algorithms and architectures. In this paper, of training points which are distributed along two
we use this two-spiral problem to illustrate the spirals in the plane. These spirals make three com-
advantages obtained from using all the additional plete turns round the origin and around each other.
knowledge about the problem domain in designing The set of training points (input, output) are
the neural net which solves a given problem. The represented graphically in Fig. 1; the positive output
characteristics of the knowledge-based net, with values are marked with a cross and the negative
regard to complexity, number of elements, training ones with a circle. The values of the coordinates,
speed and generalisation quality, make it appreci- for the 194 points which form the training set, have
ably better than alternative nets which make no use been generated using the program provided by Alex
of this knowledge. Wieland [6]. Points are placed on 32 angles and
correspond to three complete turns for each spiral
Keywords: Backpropagation; Benchmark; Generalis- (2·(32 · 3 + 1) = 194 points), the radius varying from
ation; Knowledge based design; Neural networks;
Two spirals of Wieland

1. The Two-Spiral Task


The two-spiral problem was originally used to dem-
onstrate that multi-layered structures of neurons with
a threshold function could solve non-linearly separ-
able problems, as well as to check the generalisation
capabilities of these neuron structures [1,2]. Many
authors now include it in the benchmarks for speed
and quality of learning for new algorithms and
architecture types [3,4]. In this paper, we apply the
technique of domain-knowledge based neural net
design to this problem [5]. Being a problem of
artificial origin, the knowledge which can be injected
is of a formal nature.
The task, as it appears in the documentation
included in the University of Carnegie Mellon’s

Correspondence and offprint requests to: J.R. Álvarez-Sánchez, Fig. 1. Cartesian plane representation of the data used in training
Dpto. Inteligencia Artificial, Facultad de Ciencias-UNED, Senda a neural net to solve the two-spiral problem. The crosses represent
del Rey, s/n, E-28040 Madrid, Spain. E-mail: jras얀dia.uned.es one output class and the circles the other.
266 J.R. Álvarez-Sánchez

6.5 to 0.5 on one spiral and from −6.5 to −0.5 on University of Carnegie Mellon’s repository of
the other. The values of the training set are the examples. All the reports of this repository deal
pairs of Cartesian coordinates of each point and a with neural nets with feedforward architectures, lin-
value indicating to which spiral they belong. ear-type units (neurons), non-linear sigmoid transfer
The net is trained on the 194 pairs of input/output and learning through error backpropagation by gradi-
(point/class) until it can produce the correct output ent descent. The main difference between the differ-
for each input of the example points. The time ent reports lies in the variations in the learning
required for this training is a measure of the quality algorithm, the layering structure and the number of
of the net, understood as a Problem Solving elements per layer. Table 1 gives a comparison of
Method (PSM). the results of these reports, together with the results
The choice of which output values are to represent obtained in our work (bottom row), these latter
the two classes is left up to the experimenter’s results being detailed in later sections.
judgement. In the documentation for this example, The comparison table includes data about the type
it is suggested that, to compare results, a correctness of learning algorithm: ‘BP’ refers to the standard
criterion of 40-20-40 be used for the representation gradient-descent backpropagation algorithm, ‘quick
of bipolar outputs across a range of values, assigning prop’ refers to an improvement introduced in Fahl-
each of the top and bottom 40% of the range, man [1], ‘cascade correlation’ refers to an algorithm
respectively, to the two classes, and leaving the which generates the structure of the net (in cascade)
central 20% as undefined (incorrect) output. This during the training [7], and ‘resilient propagation’
criterion is chosen to make it easier for neurons [3] refers to a modification of the backpropagation
with sigmoid output to give a correct response, but algorithm in which the weights are incremented
is not very precise. In the work reported on in this or decremented as a function of the sign of the
paper, we have adopted a much more restrictive error gradient.
criterion, equivalent to 10-80-10 with a range of −1 ‘Cascade correlation’ algorithms use a technique
to 1; that is, we consider an output between 0.8 in which the weights in hidden layers are frozen so
and 1 as an indication of belonging to class ‘+’, an that the effective number of weights modified in
output between −1 and −0.8 as an indication of each example of a given epoch are reduced, on
belonging to class ‘−’ and the intermediate values, average, (from a total of n(n+5)
2
weights for n neurons
those between −0.8 and 0.8, as incorrect. in cascade, one for each layer, all of which receive
Formulated in this way, the problem is a difficult the inputs to the net) to a value which can be
one for classical backpropagation nets, and is estimated as a function of the total number of
impossible to solve using a linear separator. For this changes (referred to as ‘cross connections’ in Fahl-
reason, it is used to compare how the training time man and Lebiere [7]) divided by the total number
varies for different algorithms and according to the of examples presented in all the epochs (that is,
number of weights (or their equivalent) in the net. divided by the number of epochs and by the number
The rest of the paper is divided into six sections. of examples), and divided by two modification
The following section summarises the most signifi- phases (forwards and backwards). This estimated
cant reference results for this problem. In Section 3, equivalent number of modifiable weights is the one
an analysis is made of the notion of adding previous given in the table (the values in italics) for the two
knowledge about the solution to the problem, and cases of type ‘cascade correlation’ which appear
of the method used to incorporate this knowledge there. To calculate the equivalent values of the
in the design of the net. weights, we have used the ‘cross connections’ data
Based on the information obtained, we then from Treadgold and Gedeon [4]. In the case of GCS
present the design of the net which we use to from Fritzke [8], the number of weights was counted
solve this problem in Section 4. In Section 5, the by hand in the connections figure.
information about the problem obtained in Section 3 In the last column of the table the average number
is used to calculate adequate initial values for the of epochs for the training is indicated and, in some
weights of the proposed net. In Section 6 the simul- cases, also the corresponding number of epochs for
ation results for this net are presented. the same problem when the point density is quad-
rupled (in brackets). The indications of the technique
quality are the lowest number of epochs and the
2. Summary of Previous Results lowest number of weights together.
To illustrate the type of generalisation usually
Here, we summarise the main aspects of the results obtained in the results, Fig. 2 shows the response
obtained by different authors and contained in the of a net, in the region of the plane used in the
Injecting Knowledge into the Solution of the Two-Spiral Problem 267

Table 1. Summary of the results for the set of examples available in the C.M.U. repository.

Authors Learn Err.Func. Transf. Connect. N #w tol #t epoch

Lang and BP ␮ = 0.5, SSE sigmoid 2=5=5=5=1 16 138 – 3 20000


Witbrock [2] ␭ = 0.001 [−1,+1] shortcut (64000)
id. cross-entr. id. id. 16 138 – 3 11000
quickprop argtanh id. id. 16 138 – 3 7900
Walker, D. BP ␭ = 0.1, cut at 0.15 sigmoid 2-20-10-1 31 281 30% – 13900
(unpublished) ␮ = 0.7 [−.,+.]
id. id. id. 2-16-8-1 25 193 30% – 300000
Leighton, R. BP + noise in argtanh, cut sigmoid 2=5=5=5=1 16 138 40% 3 4000
(unpublished) deriv. 0.001 [−.,+.] shortcut
Fahlman and cascade argtanh sigmoid cascaded, 16 31 – 100 1700
Lebiere [7] correlation [−1,+1] shortcut (2262)
Treadgold and cascor + argtanh sigmoid cascaded, 15 101 20% 100 2437
Gedeon [4] RPROP [−.,+.] shortcut
Fritzke [8] supervised GCS class err. 0, 1 RBF adapt. 2-145-1 146 530 – 1 180
This paper BP modif. SSE sin [−1,+1] 2-(2)-1 3 3 10% 100 96
Section 6 (MATLAB) (102)

BP = Backpropagation, ␭ = learning rate, ␮ = momentum coefficient, cascor = cascade correlation, RPROP = Resilient Backpropagation,
GCS = Growing Cell Structures, shortcut = neurons in a layer receive inputs from all previous layers, SSE = Sum Squared Error.
N = Number of neurons, #w = number of adjustable weights, tol = error tolerance (in % of range) for correct classification, #t = number
of tests.

black and white zones do not divide the plane into


separate clearly-defined spirals.
Better generalisation results have been obtained
with some other architectures, for example, by
Fritzke [8] or by Augusteijn and Steck [10].

3. Analysis of the Problem


The knowledge we possess about the geometric
nature of the discrimination problem which the net
has to solve is very useful for deciding which
transformations of the input and output data of the
net make the learning process more efficient. We
first examine a graphical representation of the data
to be used in the training, that presented in Fig. 1,
which highlights the central symmetry and the per-
iodic character of the alternation between points of
the two output classes. This central symmetry sug-
gests the use of polar coordinates. The relation
between the distance of the points from the origin
Fig. 2. Representation of the response, over the whole plane, of (radius) and the angle with respect to the x-axis
a net (with 15 hidden units) trained using Fahlman’s ‘cascade is shown in Fig. 3. The distribution in (periodic)
correlation’ algorithm (in black, the positive zone and in white,
the negative zone). alternating bands can be readily observed.
The points of a spiral obey the equation
r = p · (␪ + 2␲n) + ro relating the radius r and the
training, obtained from simulation using Fahlman’s angle ␪ (with n an integer indicating the number of
‘cascade correlation’ algorithm [9]. In this figure, the cycle); the parameters p and ro define the parti-
the training examples shown in Fig. 1 have been cular spiral. The geometric interpretation of these
superimposed to show that the results coincide for parameters on the polar coordinate representation of
practically all the points. Note, however, that the the data is also shown in Fig. 3. In this case they
268 J.R. Álvarez-Sánchez

Since only the output layer has adjustable weights,


we can use simple learning algorithms of the per-
ceptron type (though it is also possible to use more
elaborate algorithms which modify the weights in
proportion to the derivative of the error with respect
to the weights). In this case, we can apply, for
example, the weight-modification formula ⌬wi =
␣␾icos(w1␪ + w2r + b) · (D − Z), where ␾1 is ␪, ␾2
is r and D is the desired output.

5. Initialising the Weights


Making good use of the knowledge of the problem
Fig. 3. Representation of the radius in function of the angle
domain, we can also calculate approximate values
(between −␲ and ␲) for the training points of the two spirals. for the weights, and in this way reduce the training
The parameters of the spiral, the initial radius, ro, and the step time. We have already seen that the relation between
of the spiral, 2␲p, are indicated.
the radius r and the angle ␪ for the spirals is
r = p · (␪ + 2␲n) + ro, with n an integer, where the
correspond to 2␲p ⬵ 2, and therefore p ⬵ ␲1 , and characteristic constants p and ro define each spiral.
ro ⬵ 0.5. This information will prove useful for the We want the net output Z = sin(w1␪ + w2r +b) to
initialisation of the weights. Note that there is a be equal to 1 for the spiral, so that the argument
generalisation implicit in the design, namely that of the sine function should be w1␪ + w2r + b =
2 + 2␲m, for m any integer. Substituting r of the
not only does the net assign the correct values to ␲

the training examples, but it also divides the whole spiral equation in this latter equation and grouping
plane into two spiral regions of opposite polarity, terms, we obtain the expression
i.e. including the points not given to the net as
examples. (w1 + w2p)␪ + w2(2␲np + ro) + b

= + 2␲m; n,m 苸 Z
2
4. Selection of Pre-processing and
Transference Function Since this equation must be valid for any angle,
we can separate it into two parts: the ␪ component
To take advantage of the central symmetry property and the rest.


discussed in the previous section, we pre-process
the inputs by transforming them into polar co- ␪(w1 + w2p) = 0 ∀␪ ⇒ w1 = −w2p
ordinates. The results of this transformation are then ␲
used as the terms of a weighted sum (adjustable 2␲nw2p + w2ro + b = + 2␲m n,m 苸 Z (1)
2
weights). Due to the periodic alternation of the two
classes of points, we use a differentiable transference The second part of Eq. (1) can be separated into
function of periodic type, such as sine. terms relative to the integers and all other terms.
The transformation from Cartesian to polar co- This then leaves us with equations which relate the
ordinates can be performed in two separate units weights in function of the values measured on the
(without learning) constituting the first layer, training data pⴱ and rⴱo, which will be approximately
which generates the values r = √x2 + y2 and
␪ = arctan ( xy ) + ␲ · sign(y). This latter function for
␪, which is expressible in many programming langu-
ages as atan2(y,x), gives the angle (between −␲
and ␲), but preserving the information of the quad-
rant, information which is lost with the standard
arctangent function. The second (output) layer has
only one unit, which generates the output of the net
in the form Z = sin(w1␪ + w2r + b), where the
Fig. 4. Scheme of blocks and connections in the solution pro-
weights w1 and w2, together with the displacement posed for 2-dimensional classification problems with central or
b, are modifiable through learning. spiral symmetry.
Injecting Knowledge into the Solution of the Two-Spiral Problem 269

equal to p and ro, the parameters which represent given examples (shown superimposed) is illustrated.
the spiral (though the values p and ro are not unique Comparing this with the response typical of other
for each theoretical spiral, if negative values are nets, such as that shown in Fig. 2, it can be readily
considered). We assign initial values to the weights observed that it is more generic. Furthermore, the
w2 = npⴱ
m
; w1 = −w2pⴱ and to the displacement net can be seen to really represent the concept
b = 2 − w2rⴱ0. The values of the integer variables m

of a spiral, which is implicit in the generalisation
and n are chosen not to be divisible by 2, so that assumptions that are part of the objectives, but
an odd number of spirals of each class alternate in which only exists from the observer’s point of view
the solution. and not in the specifications themselves.
This type of initialisation relation allows the rep- To implement the transfer function as a sine
resentative values of the spiral to which the example function, as required, it was only necessary to write
points belong to be deduced, after the adjustment it and its derivative (the cosine) in the form used
through learning, the value of p and ro being by the simulation tool. We also needed to write a
obtained from the previous formulas by putting n small program for generating the training examples,
and m equal to 1 (canonical representative). the initial values of the weights (as described in the
last section), and the calls to the subroutines of the
tool for the training and the visualising of the
6. Experimental Results results. Figure 6 shows an example of the response
of the net for the initial values of the weights before
To confirm the previous theoretical studies, we car- the training. It can be seen that this first solution
ried out experiments using the commercial tool for already approximates the desired one, though train-
simulating traditional neural nets contained in the ing is needed to tune the response completely.
MATLAB ‘neural net toolbox’. More concretely, we To generate different values of the ratio between
used the standard algorithm for learning by acceler- the integers m and n, which is used to calculate the
ated backpropagation. This algorithm employs a initial values of the weights, mn was assigned to the
momentum term in the modification of the weights, value of the formula ‘1+0.2ⴱrandn+2ⴱfloor
and can increase or decrease the learning coefficient (randⴱ2)’. This generates a distribution lying
in the course of the training if the quadratic error between 1 and 5 while avoiding multiples of 2 (or
does not diminish. An example of the result of values close to this), giving rise to solutions with
training the net is shown in Fig. 5, where its interleaved multiple spirals. Such a solution is shown
response in the region of the plane where it was

Fig. 6. Response obtained, before training, with initial values of


Fig. 5. Example of the result after adjustment of the parameters. the weights (−0.7366, −2.143) and displacement (bias) 2.446.
After 109 epochs, in this case, the values of the weights were The training examples are shown superimposed (only the visible
(−1.016, −3.294) and the displacement (bias), −2.511. The training ones coincide correctly). This response is from the same simul-
examples are superimposed. ation run as the one shown in Fig. 5.
270 J.R. Álvarez-Sánchez

in Fig. 7, where a response with two interleaved


triple spirals was obtained after training and with
an initial ratio of mn close to 3.
For the initial value of the learning coefficient,
an approximation was made, ensuring that the coef-
ficient multiplied by the maximum absolute value
of the lower-range input (−␲ ⬍ ␪ ⬍ ␲) and by the
maximum accumulable error (twice the number of
examples) is smaller than half the value of the
average weight. Using this approximation, the initial
1
value of the learning coefficient is less than 2␲·2·194
−4
⬇ 4 · 10 . The learning algorithm modifies this
value, increasing or decreasing it as necessary, in
order to reduce the error. Figure 8 shows an example
of the evolution of the sum of the total error and Fig. 8. Evolution of the parameters on a 229-epoch simulation
the learning coefficient, through a series of train- run with training. The upper half of the figure shows the variation
ing epochs. of the sum of the quadratic error for all the examples (on a
logarithmic scale) until a maximum error (dotted line) is reached.
The criterion used to terminate the training in the The lower half of the figure shows the variation in the learning
algorithm employed is that the sum of the quadratic coefficient on the same simulation run.
error of the examples in an epoch falls below a
certain limit. If we assume that the errors follow a
normal distribution for all the training examples, the training valid if the maximum error is 10% of
then the standard deviation of the error is the square the range, and therefore a criterion of
root of the average quadratic error. To ensure that 194(0.13.5 ) ⬇ 0.63 must be established for the
2 2

the probability of the error remaining within a given maximum value of the sum of the quadratic error
interval is 99.9%, the maximum error (as a percent- in the examples.
age of the range) must be 3.5 times the standard 100 training tests were carried out for the net we
deviation. From this condition, the quadratic error designed, with different initial values for the
sum criterion as a function of the maximum per- weights, according to the ratios deduced in the
centage of the error range for each example, can be previous section. An average of 96 epochs was
obtained. In our case, we have chosen to consider obtained (with deviation 120, minimum of 0 and
maximum of 438 epochs). The maximum error in
each example after each simulation run was 9.8%
and therefore less than the 10% proposed as the
admissible limit.
Simulations were also carried out with a quad-
ruple density of training points. An average of 102
epochs was obtained over 100 runs (with deviation
120, minimum 0 and maximum 543 epochs). The
maximum error for each example after each simul-
ation run was 9.6% (in this case the criterion for
the maximum value of the sum of the quadratic
error was 2.5). With the proposed architecture, there
is no substantial increase in the number of training
epochs when the number of examples is quadrupled
(unlike other architectures). This is due to the
capacity of this architecture to directly represent the
structure, with no dependence on the number of
examples if these all belong to the same spirals.

7. Conclusions
Fig. 7. Example of convergence to values which represent inter-
leaved triple spirals, obtained after 229 training epochs, with In this article we present the design of a neural net
final weights (−2.995, −9.375) and displacement (bias) 6.083. to solve the well-known two-spiral problem. In the
Injecting Knowledge into the Solution of the Two-Spiral Problem 271

Fig. 9. Output of the net (Z) in response to the input of the points in the zone of interest of the XY plane. (a) Output of a correctly-
trained net in the case of homogeneous, centred, symmetric spirals; (b) output of a net with non-unitary coefficients in the layer
implementing the conversion to polar (elliptic) coordinates, concretely: r = √0.3x2 + y2; ␪ = arctan (−0.2y,x); and weights in the output
layer w1 = −2, w2 = 1, b = 0.

design phase, we have used knowledge about the necessarily equidistant or homogeneous). This modi-
problem to improve the response of the net. The fication consists in also allowing the parameters of
net obtained is simpler (smaller number of weights) the pre-processing layer (conversion to polar
and needs fewer training epochs than other neural coordinates) to be modifiable (non-homogeneous
nets commonly used to solve the same problem. We elliptic coordinates, in the case where the coef-
have also verified that the type of response generated ficients are not equal to 1). This change enables
by the net designed in this article, is of better other problems, like the example shown in Fig. 9(b),
quality from the point of view of the generalisation to be solved. This figure can be compared to
of the two-spiral problem. Fig. 9(a), in which the typical response for the con-
The architecture proposed and developed in the crete example of the spirals presented in this paper
work reported on here is, apparently, specific to the is shown.
two-spiral problem. However, this is also the case
for the architectures with which it is compared Acknowledgements. This work has been partially
(Section 2). These architectures have also been supported by the project TIC-97-0604 of the Comi-
designed specifically for this problem, selecting the sión Interministerial de Ciencia y Tecnologı́a
algorithm, the number of neurons, the number of (CICYT) of Spain.
layers (except in the ‘cascade’ algorithms case), the
type of activation function, the error function, the
initialisation of the weights, etc. References
The essential characteristic of a neural net is its
parametric nature. Starting from a skeletal model 1. Fahlman SE. Faster-learning variations on back-propa-
with a multilayered architecture, we select the pre- gation: an empirical study. In: Touretzky DS, Hinton
GE, Sejnowski TJ (eds), Proceedings Connectionist
processing to apply to the inputs, the connectivity, Models Summer School, Morgan Kaufmann, 1988
the local function of each neuron and the learning 2. Lang KJ, Witbrock MJ. Learning to tell two spirals
algorithm. The design of this architecture is knowl- apart. In: Touretzky DS, Hinton GE, Sejnowski TJ
edge injected a priori into the net. The rest, the (eds), Proceedings Connectionist Models Summer
supervised learning, is responsible for the adjustment School, Morgan Kaufmann, 1988
3. Riedmiller M, Braun H. A direct adaptive method for
of the parameters, thus projecting the generic net faster backpropagation learning: the RPROP algorithm.
and its pre-processing onto the specific net (with Proceedings IEEE International Conference on Neural
already-adjusted parameter values) which best solves Networks, IEEE, San Francisco, 1993
the problem. 4. Treadgold NK, Gedeon TD. A cascade network algor-
The net designed here for the two-spiral problem ithm employing progressive RPROP. In: Mira J, Mor-
eno-Dı́az R, Cabestany J (eds), Biological and Arti-
can be easily modified to solve other problems of ficial Computation: From Neuroscience to Technology.
the same family, i.e. problems involving points with Lecture Notes in Computer Science 1240, Springer-
central symmetry and alternation in bands (not Verlag, 1997
272 J.R. Álvarez-Sánchez

5. Russell S, Norvig P. Artificial Intelligence: A Modern 8. Fritzke B. Growing cell structures – a self-organizing
Approach. Prentice-Hall, 1995; 625–648 network for unsupervised and supervised learning.
6. Wieland A. Two spirals. In: White M (ed), http: Neural Networks 1994; 7: 1441–1460
//www.boltz.cs.cmu.edu/benchmarks/two-spirals.html. 9. Fahlman SE. Cascade neural network simulator v1.0
CMU Repository of Neural Network Benchmarks, (public domain software). In: White M (ed), http:
1988 //www.boltz.cs.cmu.edu/software.html#Cascade. CMU
7. Fahlman SE, Lebiere C. The cascade-correlation learn- Repository of Neural Network Benchmarks, 1988
ing architecture. In: Touretzky DS (ed), Advances in 10. Augusteijn MF, Steck UJ. Supervised adaptive clus-
Neural Information Processing Systems 2, Morgan tering: A hybrid neural network architecture. Neural
Kaufmann, 1990 Computing & Applications 1994; 78–89

You might also like