You are on page 1of 26

AI: Neural Networks for Beginners

Introduction
This article is Part 1 of a series of 3 articles that I am going to post. The proposed article
content will be as follows:
1. Part 1: This one, will be an introduction into Perceptron networks (single layer
neural networks)
2. Part 2: Will be about multi layer neural networks, and the back propogation
training method to solve a non-linear classification problem such as the logic of
an XOR logic gate. This is something that a Perceptron can't do. This is explained
further within this article
3. Part 3: Will be about how to use a genetic algorithm (GA) to train a multi layer
neural network to solve some logic problem

Let's start with some biology


Nerve cells in the brain are called neurons. There is an estimated 1010 to the power(1013)
neurons in the human brain. Each neuron can make contact with several thousand other
neurons. Neurons are the unit which the brain uses to process information.

So what does a neuron look like


A neuron consists of a cell body, with various extensions from it. Most of these are
branches called dendrites. There is one much longer process (possibly also branching)
called the axon. The dashed line shows the axon hillock, where transmission of signals
starts
The following diagram illustrates this.

Figure 1 Neuron
The boundary of the neuron is known as the cell membrane. There is a voltage difference
(the membrane potential) between the inside and outside of the membrane.
If the input is large enough, an action potential is then generated. The action potential
(neuronal spike) then travels down the axon, away from the cell body.

Figure 2 Neuron Spiking

Synapses
The connections between one neuron and another are called synapses. Information
always leaves a neuron via its axon (see Figure 1 above), and is then transmitted across a
synapse to the receiving neuron.

Neuron Firing
Neurons only fire when input is bigger than some threshold. It should, however, be noted
that firing doesn't get bigger as the stimulus increases, its an all or nothing arrangement.

Figure 3 Neuron Firing


Spikes (signals) are important, since other neurons receive them. Neurons communicate
with spikes. The information sent is coded by spikes.
The input to a Neuron
Synapses can be excitatory or inhibitory.
Spikes (signals) arriving at an excitatory synapse tend to cause the receiving neuron to
fire. Spikes (signals) arriving at an inhibitory synapse tend to inhibit the receiving neuron
from firing.
The cell body and synapses essentially compute (by a complicated chemical/electrical
process) the difference between the incoming excitatory and inhibitory inputs (spatial and
temporal summation).
When this difference is large enough (compared to the neuron's threshold) then the
neuron will fire.
Roughly speaking, the faster excitatory spikes arrive at its synapses the faster it will fire
(similarly for inhibitory spikes).

So how about artificial neural networks

Suppose that we have a firing rate at each neuron. Also suppose that a neuron connects
with m other neurons and so receives m-many inputs "x1 . xm", we could imagine
this configuration looking something like:

Figure 4 Artificial Neuron configuration


This configuration is actually called a Perceptron. The perceptron (an invention of
Rosenblatt [1962]), was one of the earliest neural network models. A perceptron models a
neuron by taking a weighted sum of inputs and sending the output 1, if the sum is greater
than some adjustable threshold value (otherwise it sends 0 - this is the all or nothing
spiking described in the biology, see neuron firing section above) also called an activation
function.
The inputs (x1,x2,x3..xm) and connection weights (w1,w2,w3..wm) in Figure 4 are
typically real values, both postive (+) and negative (-). If the feature of some xi tends to
cause the perceptron to fire, the weight wi will be positive; if the feature xi inhibits the
perceptron, the weight wi will be negative.
The perceptron itself, consists of weights, the summation processor, and an activation
function, and an adjustable threshold processor (called bias here after).
For convenience the normal practice is to treat the bias, as just another input. The
following diagram illustrates the revised configuration.

Figure 5 Artificial Neuron configuration, with bias as additinal input


The bias can be thought of as the propensity (a tendency towards a particular way of
behaving) of the perceptron to fire irrespective of its inputs. The perceptron configuration
network shown in Figure 5 fires if the weighted sum > 0, or if you're into math-type
explanations

Activation
Function
The activation usually
uses one of the following functions.
Sigmoid Function
The stronger the input, the faster the neuron fires (the higher the firing rates). The
sigmoid is also very useful in multi-layer networks, as the sigmoid curve allows for
differentation (which is required in Back Propogation training of multi layer networks).

or if your
into maths
type

explanations

Step Function
A basic on/off type
function, if 0 > x then 0, else if x >= 0 then 1

or if your
into
math-type

explanations

Learning

A foreword on learning
Before we carry on to talk about perceptron learning lets consider a real world example :
How do you teach a child to recognize a chair? You show him examples, telling him,
"This is a chair. That is not a chair," until the child learns the concept of what a chair is.
In this stage, the child can look at the examples we have shown him and answer correctly
when asked, "Is this object a chair?"
Furthermore, if we show to the child new objects that he hasn't seen before, we could
expect him to recognize correctly whether the new object is a chair or not, providing that
we've given him enough positive and negative examples.
This is exactly the idea behind the perceptron.
Learning in perceptrons
Is the process of modifying the weights and the bias. A perceptron computes a binary
function of its input. Whatever a perceptron can compute it can learn to compute.
"The perceptron is a program that learn concepts, i.e. it can learn to respond with True
(1) or False (0) for inputs we present to it, by repeatedly "studying" examples presented
to it.
The Perceptron is a single layer neural network whose weights and biases could be
trained to produce a correct target vector when presented with the corresponding input
vector. The training technique used is called the perceptron learning rule. The perceptron
generated great interest due to its ability to generalize from its training vectors and work
with randomly distributed connections. Perceptrons are especially suited for simple
problems in pattern classification."
Professor Jianfeng feng, Centre for Scientific Computing, Warwick university, England.
The Learning Rule
The perceptron is trained to respond to each input vector with a corresponding target
output of either 0 or 1. The learning rule has been proven to converge on a solution in
finite time if a solution exists.
The learning rule can be summarized in the following two equations:
b=b+[T-A]
For all inputs i:
W(i) = W(i) + [ T - A ] * P(i)

Where W is the vector of weights, P is the input vector presented to the network, T is the
correct result that the neuron should have shown, A is the actual output of the neuron, and
b is the bias.
Training
Vectors from a training set are presented to the network one after another.
If the network's output is correct, no change is made.
Otherwise, the weights and biases are updated using the perceptron learning rule (as
shown above). When each epoch (an entire pass through all of the input training vectors
is called an epoch) of the training set has occured without error, training is complete.
At this time any input training vector may be presented to the network and it will respond
with the correct output vector. If a vector, P, not in the training set is presented to the
network, the network will tend to exhibit generalization by responding with an output
similar to target vectors for input vectors close to the previously unseen input vector P.

So what can we use do with neural networks


Well if we are going to stick to using a single layer neural network, the tasks that can be
achieved are different from those that can be achieved by multi-layer neural networks. As
this article is mainly geared towards dealing with single layer networks, let's dicuss those
further:
Single layer neural networks
Single-layer neural networks (perceptron networks) are networks in which the output unit
is independent of the others - each weight effects only one output. Using perceptron
networks it is possible to achieve linear seperability functions like the diagrams shown
below (assuming we have a network with 2 inputs and 1 output)

It can
be seen
that this
is

equivalent to the AND / OR logic gates, shown below.

Figure 6 Classification tasks


So that's a simple example of what we could do with one perceptron (single neuron
essentially), but what if we were to chain several perceptrons together? We could build
some quite complex functionality. Basically we would be constructing the equivalent of
an electronic circuit.
Perceptron networks do however, have limitations. If the vectors are not linearly
separable, learning will never reach a point where all vectors are classified properly. The
most famous example of the perceptron's inability to solve problems with linearly
nonseparable vectors is the boolean XOR problem.
Multi layer neural networks
With muti-layer neural networks we can solve non-linear seperable problems such as the
XOR problem mentioned above, which is not acheivable using single layer (perceptron)
networks. The next part of this article series will show how to do this using muti-layer
neural networks, using the back propogation training method.
Well that's about it for this article. I hope it's a nice introduction to neural networks. I will
try and publish the other two articles when I have some spare time (in between MSc
disseration and other assignments). I want them to be pretty graphical so it may take me a
while, but i'll get there soon, I promise.

Introduction
This article is part 2 of a series of 3 articles that I am going to post. The proposed article
content will be as follows:
1. Part 1 : Is an introduction into Perceptron networks (single layer neural networks).
2. Part 2 : This one, is about multi layer neural networks, and the back propagation
training method to solve a non linear classification problem such as the logic of an

XOR logic gate. This is something that a Perceptron can't do. This is explained
further within this article.
3. Part 3 : Will be about how to use a genetic algorithm (GA) to train a multi layer
neural network to solve some logic problem.

Summary
This article will show how to use a multi-layer neural network to solve the XOR logic
problem.

A Brief Recap (From part 1 of 3)


Before we commence with the nitty gritty of this new article which deals with multi layer
Neural Networks, let just revisit a few key concepts. If you haven't read Part 1, perhaps
you should start there.

Perceptron Configuration ( Single layer network)


The inputs (x1,x2,x3..xm) and connection weights(w1,w2,w3..wm) shown below are
typically real values, both positive (+) and negative (-).
The perceptron itself, consists of weights, the summation processor, an activation
function, and an adjustable threshold processor (called bias here after).
For convenience, the normal practice is to treat the bias as just another input. The
following diagram illustrates the revised configuration.

The bias can be thought of as the propensity (a tendency towards a particular way of
behaving) of the perceptron to fire irrespective of it's inputs. The perceptron configuration
network shown above fires if the weighted sum > 0, or if you have into maths type
explanations

So that's the basic


operation of a
perceptron. But we
now want to build
more layers of these, so let's carry on to the new stuff.

So Now The New Stuff (More layers)


From this point on, anything that is being discussed relates directly to this article's code.
In the summary at the top, the problem we are trying to solve was how to use a multilayer neural network to solve the XOR logic problem. So how is this done. Well it's really
an incremental build on what Part 1 already discussed. So let's march on.
What does the XOR logic problem look like? Well, it looks like the following truth table:

Remember with a single


layer (perceptron) we can't
actually achieve the XOR
functionality, as it is not
linearly separable. But with
a multi-layer network, this
is achievable.

What Does The New Network Look Like


The new network that will solve the XOR problem will look similar to a single layer
network. We are still dealing with inputs / weights / outputs. What is new is the addition
of the hidden layer.

As already explained above, there is one input layer, one hidden layer and one output
layer.
It is by using the inputs and weights that we are able to work out the activation for a
given node. This is easily achieved for the hidden layer as it has direct links to the actual
input layer.

The output layer, however, knows nothing about the input layer as it is not directly
connected to it. So to work out the activation for an output node we need to make use of
the output from the hidden layer nodes, which are used as inputs to the output layer
nodes.
This entire process described above can be thought of as a pass forward from one layer to
the next.
This still works like it did with a single layer network; the activation for any given node
is still worked out as follows:

Where (wi is the weight(i), and Ii is the input(i) value)


You see it the same old stuff, no demons, smoke or magic here. It's stuff we've already
covered.
So that's how the network looks/works. So now I guess you want to know how to go
about training it.

Types Of Learning
There are essentially 2 types of learning that may be applied, to a Neural Network, which
is "Reinforcement" and "Supervised"

Reinforcement
In Reinforcement learning, during training, a set of inputs is presented to the Neural
Network, the Output is 0.75, when the target was expecting 1.0.
The error (1.0 - 0.75) is used for training ('wrong by 0.25').
What if there are 2 outputs, then the total error is summed to give a single number
(typically sum of squared errors). Eg "your total error on all outputs is 1.76"
Note that this just tells you how wrong you were, not in which direction you were wrong.
Using this method we may never get a result, or it could be a case of 'Hunt the needle'.
NOTE : Part 3 of this series will be using a GA to train a Neural Network, which is
Reinforcement learning. The GA simply does what a GA does, and all the normal GA
phases to select weights for the Neural Network. There is no back propagation of values.
The Neural Network is just good or just bad. As one can imagine, this process takes a lot
more steps to get to the same result.

Supervised
In Supervised Learning the Neural Network is given more information.
Not just 'how wrong' it was, but 'in what direction it was wrong' like 'Hunt the needle' but
where you are told 'North a bit', 'West a bit'.
So you get, and use, far more information in Supervised Learning, and this is the normal
form of Neural Network learning algorithm. Back Propagation (what this article uses, is
Supervised Learning)

Learning Algorithm
In brief, to train a multi-layer Neural Network, the following steps are carried out:

Start off with random weights (and biases) in the Neural Network
Try one or more members of the training set, see how badly the output(s) are
compared to what they should be (compared to the target output(s))
Jiggle weights a bit, aimed at getting improvement on outputs
Now try with a new lot of the training set, or repeat again,
jiggling weights each time

Keep repeating until you get quite accurate outputs

This is what this article submission uses to solve the XOR problem. This is also called
"Back Propagation" (normally called BP or BackProp)
Backprop allows you to use this error at output, to adjust the weights arriving at the
output layer, but then also allows you to calculate the effective error 1 layer back, and use
this to adjust the weights arriving there, and so on, back-propagating errors through any
number of layers.
The trick is the use of a sigmoid as the non-linear transfer function (which was covered in
Part 1. The sigmoid is used as it offers the ability to apply differentiation techniques.

Because this is
nicely
differentiable
it so happens
that

Which
in

context of the article can be written as


delta_outputs[i] = outputs[i] * (1.0 - outputs[i]) * (targets[i] - outputs[i])
It is by using this calculation that the weight changes can be applied back through the
network.

Things To Watch Out For


Valleys: Using the rolled ball metaphor, there may well be valleys like this, with steep
sides and a gently sloping floor. Gradient descent tends to waste time swooshing up and
down each side of the valley (think ball!)

So what can we do about this.


Well we add a momentum term,
that tends to cancel out the back
and forth movements and

emphasizes any consistent direction, then this will go down such valleys with gentle
bottom-slopes much more successfully (faster)

Starting The
Training
This is probably best demonstrated
with a code snippet from the
article's actual code:
Hide Shrink Copy Code
/// <summary>/// The main
training. The expected target
values are passed in to this/// method as parameters, and the <see
cref="NeuralNetwork">NeuralNetwork</see>/// is then updated with small
weight changes, for this training iteration/// This method also applied
momentum, to ensure that the NeuralNetwork is/// nurtured into
proceeding in the correct direction. We are trying to avoid valleys.///
If you don't know what valleys means, read the articles associated
text/// </summary>/// <param name="target">A double[] array containing
the target value(s)</param>private void train_network(double[] target)
{
//get momentum values (delta values from last pass)
double[]
delta_hidden = new double[nn.NumberOfHidden + 1];
double[] delta_outputs = new double[nn.NumberOfOutputs];
// Get the delta value for the output layer
for (int i = 0; i <
nn.NumberOfOutputs; i++)
{
delta_outputs[i] =
nn.Outputs[i] * (1.0 - nn.Outputs[i]) * (target[i] nn.Outputs[i]);
}
// Get the delta value for the hidden layer
for (int i = 0; i <
nn.NumberOfHidden + 1; i++)
{
double error = 0.0;
for (int j = 0; j < nn.NumberOfOutputs; j++)
{
error += nn.HiddenToOutputWeights[i, j] * delta_outputs[j];
}
delta_hidden[i] = nn.Hidden[i] * (1.0 - nn.Hidden[i]) * error;
}
// Now update the weights between hidden & output layer
for (int
i = 0; i < nn.NumberOfOutputs; i++)
{
for (int j = 0; j < nn.NumberOfHidden + 1; j++)
{

//use momentum (delta values from last pass),


//to ensure moved in correct direction
nn.HiddenToOutputWeights[j, i] += nn.LearningRate * delta_outputs[i] *
nn.Hidden[j];
}
}
// Now update the weights between input & hidden layer
for (int i
= 0; i < nn.NumberOfHidden; i++)
{
for (int j = 0; j < nn.NumberOfInputs + 1; j++)
{
//use momentum (delta values from last pass),
//to ensure moved in correct direction
nn.InputToHiddenWeights[j, i] += nn.LearningRate * delta_hidden[i] *
nn.Inputs[j];
}
}
}

Introduction
This article is part 3 of a series of three articles that I am going to post. The proposed
article content will be as follows:
1. Part 1: This one will be an introduction into Perceptron networks (single layer
neural networks)
2. Part 2: Will be about multi-layer neural networks, and the back propagation
training method to solve a non-linear classification problem such as the logic of
an XOR logic gate. This is something that a Perceptron can't do. This is explained
further within this article.
3. Part 3: This one is about how to use a genetic algorithm (GA) to train a multilayer neural network to solve some logic problem, ;f you have never come across
genetic algorithms, perhaps my other article located here may be a good place to
start to learn the basics.

Summary
This article will show how to use a Microbial Genetic Algorithm to train a multi-layer
neural network to solve the XOR logic problem.

A Brief Recap (From Parts 1 and 2)


Before we commence with the nitty griity of this new article which deals with multi-layer
neural networks, let's just revisit a few key concepts. If you haven't read Part 1 or Part 2,
perhaps you should start there.

Part 1: Perceptron Configuration (Single Layer Network)


The inputs (x1,x2,x3..xm) and connection weights (w1,w2,w3..wm) in figure 4 are
typically real values, both positive (+) and negative (-). If the feature of some xi tends to
cause the perceptron to fire, the weight wi will be positive; if the feature xi inhibits the
perceptron, the weight wi will be negative.
The perceptron itself consists of weights, the summation processor, and an activation
function, and an adjustable threshold processor (called bias hereafter).
For convenience, the normal practice is to treat the bias as just another input. The
following diagram illustrates the revised configuration:

The bias can be thought of as the propensity (a tendency towards a particular way of
behaving) of the perceptron to fire irrespective of its inputs. The perceptron configuration
network shown in Figure 5 fires if the weighted sum > 0, or if you are into math type
explanations.

Part 2: Multi-Layer Configuration


The multi-layer network that will solve the XOR problem will look similar to a single
layer network. We are still dealing with inputs / weights / outputs. What is new is the
addition of the hidden layer.

As already explained above, there is one input layer, one hidden layer, and one output
layer.
It is by using the inputs and weights that we are able to work out the activation for a
given node. This is easily achieved for the hidden layer as it has direct links to the actual
input layer.
The output layer, however, knows nothing about the input layer as it is not directly
connected to it. So to work out the activation for an output node, we need to make use of
the output from the hidden layer nodes, which are used as inputs to the output layer
nodes.
This entire process described above can be thought of as a pass forward from one layer to
the next.
This still works like it did with a single layer network; the activation for any given node
is still worked out as follows:

where wi is the weight(i), and Ii is the input(i) value. You see it the same old stuff, no
demons, smoke, or magic here. It's stuff we've already covered.
So that's how the network looks. Now I guess you want to know how to go about training
it.

Learning
There are essentially two types of learning that may be applied to a neural network,
which are "Reinforcement" and "Supervised".

Reinforcement
In Reinforcement learning, during training, a set of inputs is presented to the neural
network. The output is 0.75 when the target was expecting 1.0. The error (1.0 - 0.75) is
used for training ("wrong by 0.25"). What if there are two outputs? Then the total error is
summed to give a single number (typically sum of squared errors). E.g., "your total error
on all outputs is 1.76". Note that this just tells you how wrong you were, not in which

direction you were wrong. Using this method, we may never get a result, or could be hunt
the needle.
Using a generic algorithm to train a multi-layer neural network offers a Reinforcement
type training arrangement, where the mutation is responsible for "jiggling the weights a
bit". This is what this article is all about.

Supervised
In Supervised learning, the neural network is given more information. Not just "how
wrong" it was, but "in what direction it was wrong", like "Hunt the needle", but where
you are told "North a bit" "West a bit". So you get, and use, far more information in
Supervised learning, and this is the normal form of neural network learning algorithm.
This training method is normally conducted using a Back Propagation training method,
which I covered in Part 2, so if this is your first article of these three parts, and the back
propagation method is of particular interest, then you should look there.

So Now the New Stuff


From this point on, anything that is being discussed relates directly to this article's code.
What is the problem we are trying to solve? Well, it's the same as it was for Part 2, it's the
simple XOR logic problem. In fact, this articles content is really just an incremental
build, on knowledge that was covered in Part 1 and Part 2, so let's march on.
For the benefit of those that may have only read this one article, the XOR logic problem
looks like the following truth table:

Remember with a single


layer (perceptron), we can't
actually achieve the XOR
functionality as it's not
linearly separable. But with
a multi-layer network, this
is achievable.
So with this in mind, how are we going to achieve this? Well, we are going to use a
Genetic Algorithm (GA from this point on) to breed a population of neural networks that
will hopefully evolve to provide a solution to the XOR logic problem; that's the basic
idea anyway.
So what does this all look like?

As can be seen from the figure above, what we are going to do is have a GA which will
actually contain a population of neural networks. The idea being that the GA will jiggle
the weights of the neural networks, within the population, in the hope that the jiggling of
the weights will push the neural network population towards a solution to the XOR
problem.

So How Does This Translate Into an Algorithm


The basic operation of the Microbial GA training is as follows:

Pick two genotypes at random


Compare scores (fitness) to come up with a winner and loser
Go along genotype, at each locus (point)
o
o

With some probability, copy from winner to loser (overwrite)


With some probability, mutate that locus of the loser
So only the loser gets changed, which gives a version of Elitism for free;
this ensures the best in breed remains in the population.

That's it. That is the complete algorithm.


But there are some essential issues to be aware of when playing with GAs:
1. The genotype will be different for a different problem domain

2. The fitness function will be different for a different problem domain


These two items must be developed again whenever a new problem is specified. For
example, if we wanted to find a person's favourite pizza toppings, the genotype and
fitness would be different from that which is used for this article's problem domain.
These two essential elements of a GA (for this article problem domain) are specified
below.

1. The Geneotype
For this article, the problem domain states that we had a population of neural networks.
So I created a single dimension array of NeuralNetwork objects. This can be seen from
the constructor code within the GA_Trainer_XOR object:
Hide Copy Code
//ANN'sprivate NeuralNetwork[] networks;
public GA_Trainer_XOR()
{
networks = new NeuralNetwork[POPULATION];
//create new ANN objects, random weights applied at start
(int i = 0; i <= networks.GetUpperBound(0); i++)
{
networks[i] = new NeuralNetwork(2, 2, 1);
networks[i].Change +=
new NeuralNetwork.ChangeHandler(GA_Trainer_NN_Change);
}
}

for

2. The Fitness Function


Remembering the problem domain description stated, the following truth table is what we
are trying to achieve:

So how can we tell how fit


(how close) the neural
network is to this ? It is
fairly simply really. What
we do is present the entire
set of inputs to the Neural
Network one at a time and
keep an accumulated error
value, which is worked out as follows:
Within the NeuralNetwork class, there is a getError(..) method like this:

Hide Copy Code


public double getError(double[] targets)
{
//storage for error
double error = 0.0;
//this calculation is based on something I read about weight space
in
//Artificial Intellegence - A Modern Approach, 2nd
edition.Prentice Hall
//2003. Stuart Rusell, Peter Norvig. Pg 741
error = Math.Sqrt(Math.Pow((targets[0] - outputs[0]), 2));
return error;
}
Then in the NN_Trainer_XOR class, there is an Evaluate method that accepts an int
value which represents the member of the population to fetch and evaluate (get fitness
for). This overall fitness is then returned to the GA training method to see which neural
network should be the winner and which neural network should be the loser.
Hide Copy Code
private double evaluate(int popMember)
{
double error = 0.0;
//loop through the entire training set
for (int i = 0; i <=
train_set.GetUpperBound(0); i++)
{
//forward these new values through network
//forward
weights through ANN
forwardWeights(popMember, getTrainSet(i));
double[] targetValues = getTargetValues(getTrainSet(i));
error += networks[popMember].getError(targetValues);
}
//if the Error term is < acceptableNNError value we have
found
//a good configuration of weights for teh NeuralNetwork, so
tell
//GA to stop looking
if (error < acceptableNNError)
{
bestConfiguration = popMember;
foundGoodANN = true;
}
//return error
return error;
}
So how do we know when we have a trained neural network? In this article's code, what I
have done is provide a fixed limit value within the NN_Trainer_XOR class that, when
reached, indicates that the training has yielded a best configured neural network.
If, however, the entire training loop is done and there is still no well-configured neural
network, I simply return the value of the winner (of the last training epoch) as the overall
best configured neural network.
This is shown in the code snippet below; this should be read in conjunction with the
evaluate(..) method shown above:

Hide Copy Code


//check to see if there was a best configuration found, may not have
done//enough training to find a good NeuralNetwork configuration, so
will simply//have to return the WINNERif (bestConfiguration == -1)
{
bestConfiguration = WINNER;
}//return the best Neural networkreturn networks[bestConfiguration];

You might also like