A neural network is a computing model, inspired by the mammalian neural system. Neural network models are algorithms for cognitive tasks, such as learning and optimization. ANN can be used to classify radar and sonar signals, Target acquisition and tracking, Analyze intelligence inputs and Optimize scarce resources.
A neural network is a computing model, inspired by the mammalian neural system. Neural network models are algorithms for cognitive tasks, such as learning and optimization. ANN can be used to classify radar and sonar signals, Target acquisition and tracking, Analyze intelligence inputs and Optimize scarce resources.
Copyright:
Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online from Scribd
A neural network is a computing model, inspired by the mammalian neural system. Neural network models are algorithms for cognitive tasks, such as learning and optimization. ANN can be used to classify radar and sonar signals, Target acquisition and tracking, Analyze intelligence inputs and Optimize scarce resources.
Copyright:
Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online from Scribd
Neural computing is the study of networks of adaptable nodes which, through a process of learning from task examples, store experiential knowledge and make it available for use.
2 What Are Neural Networks? A computing model, inspired by the mammalian neural system, composed of many simple, highly interconnected processing units. Neural network models are algorithms for cognitive tasks, such as learning and optimization, which are in a loose sense based on concepts derived from research into the nature of the brain. 3 What Are Neural Networks? Neural network model is a directed graph with the following properties: A state variable n i is associated with each node i. A real value weight w ij is associated with each link from node i to node j. A real value bias u i is associated with each node i. A transfer function f i (n j , w ij , u i ) is defined, for each node i, which determines the state of node i. 4 What Can ANN Do? Biological Modeling the retina Modeling brain disorders (ADD) Business Evaluate probability of oil in geological formation Identify and filter promotion and job applicants Mine corporate databases for business rules Financial Assessing credit risk Identify forgeries Interpret handwritten forms Predict portfolio and stock values 5 What Can ANN Do? Manufacturing Automated robot control systems Control material flow Optimize production lines Quality inspection Medical Analyze speech in hearing aids Diagnose and prescribe treatment by symptoms Monitor surgery and recovery Read X-rays and CET/PET Scans 6 What Can ANN Do? Military Classify radar and sonar signals Target acquisition and tracking Analyze intelligence inputs Optimizing scarce resources Signal processing Adaptive Noise Canceling Zip Code Reader Speech Recognition 7 A Brief History First concepts Turing 1936 McCulloch & Pitts 1943 Hebb 1949 Early steps 1950s - 1960s The perceptron ADALINE and MADALINE Excessive hype
8 A Brief History Stunted growth 1969-1981 Perceptrons by Minskey and Papert Continued work Renewed interest The Hopfield model 1982 Backpropagation rediscovered 1985 (first 1974 by Werbos) Radial Basis Functions - Broomhead & Lowe 1988 9 A Quick Word About The Brain 10 The Biological Neuron Cell Body Synapse() Dendrites() Axons() 11 Computers And The Brain We do not understand the brain The ANN model is only loosely based on the brain The ANN model is metaphoric to the brain 12 Computers vs. Neural Networks Von-Neumann Machines Neural Networks Few strong processors ~10 11 Simple neurons Serial processing Parallel processing Central control No central control 10 -9 sec. Cycle 10 -3 sec. Cycle Bit data Voltage data Not tolerant Very robust Fast numeric operations Slow numeric operations Slow high operations Fast high operations Learning ? Learning ! 13 Building Blocks Of The Model The processing element The connections Learning methods 14 Processing Element Building Block The basic building block of a neural network is the processing element (or node or unit). A generalised node embodies elements: inputs(+bias) weights transfer function combining function activation function output(s) 15 The function of a single node The job of a processing element is to receive a number of inputs (either from the external world or from other nodes or from itself) and to distribute a single output (either to the external world or to other nodes). 16 Some Input Functions Weighted Summation net = w 1 x 1 + w 2 x 2 + + w n x n + bias where w i is the weight associated with the connection between an input and the processing element
17 Some Input Functions Multiplication (or Product) net = w 1 x 1 * w 2 x 2 * * w n x n
similar to the weighted summation but the summation is replaced by the product Maximum, Minimum, Majority net = max (w n x n ) net = min (w n x n ) net = 1 IF E (w n x n ) > 0 ELSE -1 18 Some Activation Functions Sigmoid maps an input into a value between zero and one Linear where no transformation takes place to the outcome of the combing function Tangent similar to the sigmoid but the mapping is between -1 and 1 Step where the transfer value equals 1 if the outcome of the combing function is greater than some threshold, otherwise it equals 0 19 Some Activation Functions
20 Closer Look At Transfer Functions Unipolar Sigmoid
Threshold()
Bipolar Sigmoid
Sign
21 The Connections The connections are the only thing changing in neural networks Connections may be either inhibitory or excitatory Connection strengths are expressed by weights 22 The role of the weights Each input or node is connected to a processing element
Graphically this is represented by an arc
Each arc has a weight. The weight simply determines the influence (or strength) of an input to a processing element
Neuro-computing is concerned with identification of the correct set of weights 23 An example of a single node Assume a processing element receives 3 inputs: 1 0.5 0.3 If the combining function is the weighted summation and the weights are: -0.2 0.04 2.35 then the result of the combining function is 0.705
1 0.5 0.3 -0.2 0.04 2.35 0.705 24 An example of a single node If the activation function is linear f(x)=x then output is 0.705
1 0.5 0.3 -0.2 0.04 2.35 0.705 f(x)=x 0.705 25 An example of a single node If the activation function is sigmoid then output is 1 / (1 + exp(-0.705)) = 0.669 1 0.5 0.3 -0.2 0.04 2.35 0.705 f(x)=1/(1+exp(x) 0.705 26 Neural Networks Layers NN can be constructed using a number of processing elements Rather than a chaotic construction it is generally preferable to build neural networks using layers A neural network will have an input layer, an output layer and in between zero, one or more of hidden layers 27 Neural Network Layers 2 Depending on where a processing element is placed, it is categorised as an input, hidden or output processing element Typically, but not necessarily, each processing element in a layer has the same transfer function a NN with 4-3-2 configuration is a 2 or 3 layer NN (depends on if input layer is counted) with 4 input nodes, 3 hidden nodes, 2 output nodes 28 The Role of the Input Layer An input processing element receives input from the external world and simply sends the actual input to the processing elements of the next layer 29 The Role of the Hidden Layer A hidden processing element receives its input from the nodes of the previous layer and the transformation of the input is sent to the next layer
A hidden layer may be seen as a pre-processor 30 The Role of the Output Layer An output processing element delivers the representation of the original input after transformations have taken place to the world 31 Connectivity Matters A number of different networks can be constructed - differ in terms of the connectivity pattern and the number of layers No hidden layers are called single-layer networks One or more hidden layers are called multi-layer networks If all connections lead from input to output then it is called a feed-forward network If there are connections in the opposite direction then it is called a feedback or recurrent network 32 Artificial Neural Networks Models
Single layer feedforward Multi layer feedforward Recurrent ( feedforward ) 33 Calculations of a multi-layer feed-forward neural network x 2
+1 +1 1.5 -1 0.5 +1 +1 0.5 +1 x 1
x 4
x 3
x 5
34 Learning Laws As we saw on the previous slide the output with the current weights is wrong if we want to perform AND.
This bring to us the problem of finding the correct set of weights
The process of identifying the correct set of weights is called the learning process and it is characterised by a learning law
35 Learning Laws 2 The purpose of a learning law is to locate the set of weights which will give correct answers for all the inputs
The learning is achieved by employing an algorithm which iteratively changes the weights of the connections in response to every set of inputs until the correct weights have been located 36 Learning Laws 3 Most learning laws are based on Hebbs rule which states that if two units are simultaneously active, increase the strength of the connection between them
This rule is the basis for most learning laws used today (Kohonen learning, Boltzman learning, Delta rule) 37 Some Learning Rules Hebbian learning rule
Perceptron learning rule
Delta learning rule
Widrow-Hoff learning rule j t i ij x x w cf w ) ( = A ( ) | | j t i i ij x x w d c w sgn = A ( ) ( ) j i i i ij x net f o d c w ' = A ( ) j t i i ij x x w d c w = A 38 Learning Methods Supervised approach a neural network is given a set of inputs and also the correct output
39 Learning Methods 2 Unsupervised approach a neural network is given a set of inputs and no outputs. The network attempts to generate its own classes
40 Learning Methods 3 Reinforcement approach a neural network is given a set of inputs and no outputs. The network generates an output and only then it is told if the produced output was correct or not Learn by doing 41 Single-Layer Perceptrons Network architecture x1 x2 x3 w1 w2 w3 w0 y= signum(net) y=step(net) net= E x i * w i - u = E x i * w i + w 0 where w 0 = u = E x i * w i
where i=0 now Signum(net) = 1 if net > 0 else -1 Step(net)=1 if net > 0 else 0 42 Example I - The AND Function X 1
X 2
W 2 = W 1 = W 0 = O 1 1 2 1,1 ---> 1 rest ---> 0 43 Single-Layer Perceptrons If correct response no modification takes place, else
An entire pass through all of the input training vectors is called an epoch. When such an entire pass of the training set has occurred without error, training is complete. ( ) | | j t i i ij x x w d c w sgn = A 44 Limitations Perceptron networks have several limitations. First, the output values of a perceptron can take on only one of two values (True or False). Second, perceptrons can only classify linearly separable sets of vectors. If a straight line or plane can be drawn to separate the input vectors into their correct categories, the input vectors are linearly separable and the perceptron will find the solution. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly. The most famous example is the boolean XOR problem.
45 The XOR problem In 1960s perceptrons created a great deal of interest until. M.Minsky and S. Papert Perceptrons MIT Press Cambridge MA 1969 single-layer perceptrons can only be used for toy problems since
cannot represent a simple XOR function 46 The XOR problem 2 The task is to classify a binary input vector to class 0 if the vector has an even number of 1s or assign it to class 1.
A two-input binary XOR truth table: 0 0 0 0 1 1 1 0 1 1 1 0 47 The XOR problem 3 Recall that the output of a perceptron is given as follows: 1 if the weighted input is greater than 0 0 otherwise The first input of XOR is 0 0 with desired output as 0 hence the weighted input must be less or equal than zero in order to get the desired output 0 w1 + 0 w2 + 1 wo < = 0 wo < = 0 48 The XOR problem 4 The second input of XOR is 0 1 with desired output as 1 hence the weighted input must be greater than zero in order to get the desired output 0 w1 + 1 w2 + 1 wo > 0 w2 + wo > 0 49 The XOR problem 5 The third input of XOR is 1 0 with desired output as 1 hence the weighted input must be greater than zero in order to get the desired output 1 w1 + 0 w2 + 1 wo > 0 w1 + wo > 0 50 The XOR problem 6 The fourth input of XOR is 1 1 with desired output as 0 hence the weighted input must be less or equal than zero in order to get the desired output 1 w1 + 1 w2 + 1 wo < = 0 w1 + w2 + wo < = 0 51 The XOR problem 7 In summary the percptron requires satisfying the following four inequalities wo < = 0 w2 + wo > 0 w1 + wo > 0 w1 + w2 + wo < = 0 The first inequality tell us that wo must be less or equal to zero. Therefore for 2nd and 3rd to apply must have w2 and w1 respectively as positive numbers - which contradicts with the 4th which says that their summation must be negative or zero 52 Linear Separability For binary inputs and outputs using the step function the output is 1 if the net input is positive and 0 if the net input is negative
net_input = 0: for two-inputs this equation represents a line
If there are weights so that all of the training input vectors for which the correct response is +1 lie on one side of the decision line and all of the training input vectors for which the correct response is 0 lie on the other side of the boundary then the problem is linearly separable 53 Linear Separability 54 The XOR problem 8 The XOR problem is not linearly separable We can not use a single-layer perceptron to construct a straight line to partition the two dimensional input space into two regions, each containing only data points of the same class
X Y 0 1 0 1 0 0 1 1 55 Multi-Layer Perceptrons The lack of suitable training methods for multi-layer perceptrons (MLPs) led to a waning of interest until the reformulation of the backpropagation training method Previous work used signum or step activation functions which are nondifferentiable, now continuous activation functions are employed 56 Multi-Layer Perceptrons 2 All nodes (or neurons) perform the same function on incoming signals a composite of the weighted sum and a differentiable nonlinear activation function together known as the transfer function 57 Multi Layer Feedforward Networks The layers that are neither input nor output are called hidden layers Hidden layers extract high order statistics and in a way provide an overall view of the input data The output of each layer is used as input to the next layer There is no theoretical limit on connections between non neighboring layers 58 MLP Architecture 2-2-1 x2 I n p u t l e v e l I n t e r m e d i a t e l e v e l ( H i d d e n ) O u t p u t l e v e l y x1 h1 h2 59 Activation Functions Logistic function f(net) = 1 / (1 + e -net ) Hyperbolic tangent function f(net) = tanh(net/2) = (1 - e -net ) / (1 + e -net ) = (2 / (1+e -net ) ) - 1 = (e net - e -net ) / (e net + e -net ) Identity function f(net) = net where net is the weighted input
60 Activation Functions 2 Logistic and Hyperbolic tangent function approximate the signum and step function respectively but they provide smooth, non-zero derivatives with respect to the input signals referred to as squashing functions since the inputs to these functions are squashed to the range [0,1] or [- 1,1] referred to as sigmoidal functions because of their S- shaped curves the hyperbolic is sometimes referred to as the bipolar sigmoidal the logistic is sometimes referred to as the binary sigmoidal 61 Activation Functions Graphs
The Logistic Function -2 The Hyperbolic Function -2 62 Identity Activation Function Identity function it is usually employed for nodes of the output layer to approximate a continuous valued function not limited to [0,1] or [-1,1] such nodes are referred to as the linear nodes
The Identity Function -2 63 Binary and Bipolar Sigmoid Derivatives f(net) = 1 / (1 + e -net )
f(net) = f(net) [ 1-f(net) ]
f(net) = (2 / (1+e -net ) ) - 1
f(net) = 0.5 [ 1 + f(net) ] [ 1 - f(net) ] 64 Learning Learning target: minimize the difference between actual outputs and target outputs
Learning rule: Steepest descent (Back-propagation) Conjugate gradient method All optimization methods using first derivative Derivative-free optimization
65 MLP and the backpropagation algorithm 66 67 68 MLP and the backpropagation algorithm o j ( d e s i r e d o u t p u t ) h i w i j w k i x k X S i g n a l E r r o r I n p u t L a y e r H i d d e n L a y e r O u t p u t L a y e r y j 69 Backpropagation Algorithm 0 Initialise Weights 1 While Stopping condition is false, do steps 2 to 9
70 Backpropagation Algorithm 2 2 For each training pair, do steps 3 to 8 Feedforward pass 3 Each input unit receives input signal and broadcasts this signal to all units in the layer above (the hidden units) 4 Each hidden unit sums its weighted input signals, applies its activation function to compute its output signal and sends this signal to all units in the layer above (output units) 5 Each output unit sums its weighted input signals and applies its activation function to compute its output signal End of Feedforward Pass 71 Backpropagation Algorithm 3 Backward Pass 6 Each output unit receives a target pattern corresponding to the input training pattern, computes its error information term, calculates its weight and bias correction term, and sends its error information term to units in the layer below 7 Each hidden unit sums its error information terms (from units in the layer above) multiplies by the derivative of its activation function to calculate its error information term, calculates its weight and bias correction term End of Backward pass 72 Backpropagation Algorithm 4 Updating Pass 8 Each output unit updates its bias and weights. Each hidden unit updates its bias and weights. End of Updating pass
9 Test stopping criterion 73 Backpropagation Algorithm 5 74 Problems How to determine the architecture? How to determine the parameters? How to get global optima? ... ... 75 GA and ANN Three levels: connection weights: introduce an adaptive and global approach to training architectures: adapt the topologies to different tasks without human intervention and thus provide an approach to automatic ANN design as both ANN connection weights and structures learn rules: learning to learn, an adaptive process of automatic discovery of novel learning rules 76 Evolution of connection weights Weight training in ANNs is usually formulated as minimization of an error function, such as the mean square error between target and actual outputs averaged over all examples, by iteratively adjusting connecting weights. BP often gets trapped in a local minimum of the error function and is incapable of finding a global minimum if the error function is multimodal and/or nondifferentiable. GA can be used effectively in the evolution to find a near- optimal set of connection weights globally without computing gradient information. 77 Typical cycle of the evolution of the connection weights 1 Decode each individual in the current generation into a set of connection weights and construct a corresponding ANN with the weights 2 Evaluate each ANN by computing its total mean square error between actual and target outputs. The fitness of an individual is determined by the error. A regularization term may be included in the fitness function to penalize large weights. 3 Select parents for reproduction based on their fitness 4 Apply genetic operators, such as crossover and mutation, to parents to generate offspring, which form the next generation 78 Representation Binary or real number Put connection weights to the same node together. Nodes in ANN are in essence feature extractors and detectors. Separating inputs to the same node far apart would increase the difficulty of constructing useful feature detectors because they might be destroyed by crossover operators. Permutation problem: The many-to-one mapping from the representation to the actual ANN since two ANNs that order their hidden nodes differently in their chromosomes will still be equivalent functionally. This makes crossover operator very inefficient in producing good offspring. 79 80 Comparison between GA and BP GA can handle the global search problem better. It can be used to train many different networks regardless of their architecture and saves a lot of human efforts in developing different training algorithm for different types of ANN. GA makes it easier to generate ANN with some special characteristics. GA is much less sensitive to initial conditions of training. There is no clear winner in terms of the best training algorithm. 81 Hybrid training Combine GAs global search ability with local searchs ability to fine tune. GA can be used to locate a good region in the space and then a local search procedure is used to find a near-optimal solution in this region. 82 The evolution of architecture The architecture of an ANN includes its topological structure, i.e., connectivity, and the transfer function of each node in the ANN. The architecture has significant impact on a networks information processing capabilities. Given a learning task, an ANN with only a few connections and linear nodes may not be able to perform the task at all due to its limited capability, while an ANN with a large number of connections and nonlinear nodes may overfit noise in the training data and fail to have good generalization ability. 83 Traditional way to design the architecture There is no systematic way to design a near-optimal architecture for a given task automatically. A constructive algorithm starts with a minimal network (network with minimal number of hidden layers, nodes and connections) and adds new layers, nodes and connections when necessary during training. A destructive algorithm starts with a maximal network (network with maximal number of hidden layers, nodes and connections) and deletes unnecessary layers, nodes and connections when during training. Such structural hill climbing methods are susceptible to becoming trapped at structural local optima. They only investigate restricted topological subsets rather than the complete class of network architecture. 84 Typical cycle of the evolution of architecture 1 Decode each individual in the current generation into an architecture. 2 Train each ANN with the decoded architecture by a predefined learning rule starting from different sets of random initial connection weights and learning rule parameters. 3 Compute the fitness of each individual according to the above training result and other performance criteria such as the complexity of the architecture. 4 Select parents from the population based on their fitness. 5 Apply search operators to the parents and generate offspring which form the next generation. 85 The direct encoding scheme An NN matrix C=(c(i,j)) can represent an ANN architecture with N nodes, where c(i,j) indicates presence or absence of the connection from node i to node j. Such an encoding scheme can handle both feedforward and recurrent ANNs. 86 A feedforward ANN 87 A recurrent ANN 88 Notes about direct encoding scheme It is straightforward to implement. Training error, training time, complexity can be used in the fitness function A large ANN would require a very large matrix and thus increase the computation time of the evolution. Domain knowledge can be used to reduce the search space The permutation problem still exists 89 The indirect encoding scheme Only some characteristics of an architecture are encoded to reduce the length of the chromosome. The details about each connection in an ANN is either predefined according to prior knowledge or specified by a set of deterministic development rules. 90 Parametric representation ANN architectures may be specified by a set of parameters such as the number of hidden layers, the number of hidden nodes in each layer, the number of connections between two layers, etc. In general the parametric representation method will be most suitable when we know what kind of architectures we are trying to find. 91 Example of pattern recognition Input Output Input Output 0000 00 0100 00 1100 00 1000 00 1001 01 0000 01 1101 01 0101 01 0010 11 1010 11 0110 11 1110 11 0011 10 0111 10 1011 10 1111 10 In fact the first two bits of the input are noise and the output is the Gray code of the last two bits of the input. 92 Chromosome We use a 16-bit chromosome The first 2 bits stand for the study ratio: 0.5, 0.25, 0.125, 0.0625 The next 2 bits stands for the momentum: 0.9, 0.8, 0.7, 0.6 The next 2 bits stands for the range of the initial weight: 1, 0.5, 0.25, 0.125 The next 5 bits is used for the 1st hidden layer: the first bit means if there is a hidden layer and the other 4 bits stands for the number of hidden units. The last 5 bits is used for the 2nd hidden layer: the first bit means if there is a hidden layer and the other 4 bits stands for the number of hidden units. 93 Evolution and result Only use the first 8 samples for evolution. Use 7 of these 8 samples for training the ANN and the other one is used to get the fitness. Finally we get a 4-1-4-2 ANN(structure and weight). In order to check the final result we use the other 8 samples and compare with a 4-16-16-2 ANN which is trained by BP. 94 Developmental rule representation Development rules, which are used to construct architectures, are encoded in chromosomes. A development rule is usually described by a recursive equation or a production system. How to get such a set of rules to construct an ANN? One answer is to evolve them. We can encode the whole rule set as an individual (Pittsburgh approach) or encode each rule as an individual (Michigan approach) 95 Examples of some development rules 96 Development of an ANN architecture 97 Simultaneous evolution of architectures & weights 98 Evolution of learning rules An ANN training algorithm may have different performance when applied to different architectures. The design of training rules, more fundamentally the learning rules used to adjust weights, depends on the type of architectures under investigation. Different variants of the Hebbian learning rule have been proposed to deal with different architectures. It is desirable to develop an automatic and systematic way to adapt the learning rule to an architecture and the task to be performed. Designing a learning rule manually often implies that some assumptions, which are not necessarily true in practice, have to be made. 99 Typical cycle of the evolution of learning rule 1 Decode each individual in the current generation into a learning rule 2 Construct a set of ANNs with randomly generated architectures and initial connection weights, and train them using the decoded learning rule. 3 Calculate the fitness of each individual according to the average training result 4 Select parents from the current generation according to their fitness 5 Apply search operators to parents to generate offspring which form the new generation 100 Evolution of algorithm parameters The adaptive adjustment of BPs parameters through evolution could be considered as the first attempt to the evolution of learning rules. Some researchers used an GA process to find parameters for BP but ANNs architecture was predefined. The parameters evolved in this case tend to be optimized towards the architecture rather than being generally applied to learning. Some researchers encoded BPs parameters in chromosomes together with ANNs architecture. 101 Evolution of learning rules The evolution of learning rules has to work on the dynamic behavior of an ANN. Try to develop a universal representation scheme which can specify any kind of dynamic behaviors is clearly impractical. Two basic assumptions which have often been made on learning rules are 1) weight-updating depends only on local information such as the activation of the input node, the activation of the output node, the current connection weight, etc.; 2) the learning rule is the same for all connections in an ANN 102 Learning rule A learning rule can be described by the following function
There are three major issues involved in the evolution of learning rules: 1) determination of a subset of terms described in the above equation; 2) representation of the coefficients as chromosomes, and 3) the GA used to evolve these chromosomes. 103 Other combination between GA and ANN Evolution of input features: finding a near-optimal set of input features to an ANN ANN as fitness estimator: the time-consuming fitness evaluation based on real systems is replaced by fast fitness evaluation based on ANN Evolving ANN ensembles: combining different individuals in the population to form an integrated system is expected to produce better results. 104 A general framework for GA and ANN