Professional Documents
Culture Documents
Amrender Kumar
I.A.S.R.I., Library Avenue, Pusa, New Delhi-110 012
akjha@iasri.res.in
1. Introduction
Neural networks, more accurately called Artificial Neural Networks (ANNs), are computational
models that consist of a number of simple processing units that communicate by sending signals
to one another over a large number of weighted connections. They were originally developed
from the inspiration of human brains. In human brains, a biological neuron collects signals from
other neurons through a host of fine structures called dendrites. The neuron sends out spikes of
electrical activity through a long, thin stand known as an axon, which splits into thousands of
branches. At the end of each branch, a structure called a synapse converts the activity from the
axon into electrical effects that inhibit or excite activity in the connected neurons. When a neuron
receives excitatory input that is sufficiently large compared with its inhibitory input, it sends a
spike of electrical activity down its axon. Learning occurs by changing the effectiveness of the
synapses so that the influence of one neuron on another changes. Like human brains, neural
networks also consist of processing units (artificial neurons) and connections (weights) between
them. The processing units transport incoming information on their outgoing connections to
other units. The "electrical" information is simulated with specific values stored in those weights
that make these networks have the capacity to learn, memorize, and create relationships amongst
data. A very important feature of these networks is their adaptive nature where "learning by
example" replaces "programming" in solving problems. This feature makes such computational
models very appealing in application domains where one has little or incomplete understanding
of the problem to be solved but where training data is readily available. These networks are
“neural” in the sense that they may have been inspired by neuroscience but not necessarily
because they are faithful models of biological neural or cognitive phenomena. ANNs have
powerful pattern classification and pattern recognition capabilities through learning and
generalize from experience. ANNs are non-linear data driven self adaptive approach as opposed
to the traditional model based methods. They are powerful tools for modelling, especially when
the underlying data relationship is unknown. ANNs can identify and learn correlated patterns
between input data sets and corresponding target values. After training, ANNs can be used to
predict the outcome of new independent input data. ANNs imitate the learning process of the
human brain and can process problems involving non-linear and complex data even if the data
are imprecise and noisy. These techniques are being successfully applied across an extraordinary
range of problem domains, in areas as diverse as finance, medicine, engineering, geology,
physics, biology and agriculture. There are many different types of neural networks. Some of the
most traditional applications include classification, noise reduction and prediction.
2. Review
Genesis of ANN modeling and its applications appear to be a recent development. However, this
field was established before the advent of computers. It started with the modeling the functions
of a human brain by McCulloch and Pitts in 1943, proposed a model of “computing element”
called Mc-Culloch – Pitts neuron, which performs weighted sum of the inputs to the element
followed by a threshold logic operation. Combinations of these computing elements were used to
Artificial Neural Networks for Data Mining
realize several logical computations. The main drawback of this model of computation is that the
weights are fixed and hence the model could not learn from examples. Hebb (1949) proposed a
learning scheme for adjusting a connection weight based on pre and post synaptic values of the
variables. Hebb’s law became a fundamental learning rule in neuron – network literature.
Rosenblatt (1958) proposed the perceptron models, which have weights adjustable by the
perceptron learning law. Widrows and Hoff (1960) proposed an ADALINE (Adaptive Linear
Element) model for computing elements and LMS (Least Mean Square) learning algorithm to
adjust the weights of an ADALINE model. Hopfield (1982) gave energy analysis of feed back
neural networks. The analysis has shown the existence of stable equilibrium states in a feed back
network, provided the network has symmetrical weights. Rumelhart et al. (1986) showed that it
is possible to adjust the weights of a multilayer feed forward neural network in a systematic way
to learn the implicit mapping in a set of input – output patterns pairs. The learning law is called
generalized delta rule or error back propagation. Cheng and Titterington (1994) made a detailed
study of ANN models vis-a-vis traditional statistical models. They have shown that some
statistical procedures including regression, principal component analysis, density function and
statistical image analysis can be given neural network expressions. Warner and Misra (1996)
reviewed the relevant literature on neural networks, explained the learning algorithm and made a
comparison between regression and neural network models in terms of notations, terminologies
and implementation. Kaastra and Boyd (1996) developed neural network model for forecasting
financial and economic time series. Dewolf and Francl (1997, 2000) demonstrated the
applicability of neural network technology for plant diseases forecasting. Zhang et al. (1998)
provided the general summary of the work in ANN forecasting, providing the guidelines for
neural network modeling, general paradigm of the ANNs especially those used for forecasting.
They have reviewed the relative performance of ANNs with the traditional statistical methods,
wherein in most of the studies ANNs were found to be better than the latter. Sanzogni and Kerr
(2001) developed models for predicting milk production from farm inputs using standard feed
forward ANN. Chakraborty et al. (2004) utilized the ANN technique for predicted severity of
anthracnose diseases in legume crop. Gaudart et al. (2004) compared the performance of MLP
and that of linear regression for epidemiological data with regard to quality of prediction and
robustness to deviation from underlying assumptions of normality, homoscedasticity and
independence of errors and it was found that MLP performed better than linear regression. More
general books on neural networks, to cite a few, Hassoun (1995), Patterson (1996), Schalkoff
(1997), Yegnanarayana (1999), Anderson (2003) etc. are available. Software on neural networks
has also been made, to cite a few, Statistica, Matlab etc. Commercial Software:- Statistica Neural
Network, TNs2Server,DataEngine, Know Man Basic Suite, Partek, Saxon, ECANSE -
Environment for Computer Aided Neural Software Engineering, Neuroshell, Neurogen,
Matlab:Neural Network Toolbar, Tarjan, FCM(Fuzzy Control manager) etc. Freeware Software:-
NetII, Spider Nets Neural Network Library, NeuDC, Binary Hopfeild Net with free Java source,
Neural shell, PlaNet, Valentino Computational Neuroscience Work bench, Neural Simulation
language version-NSL, etc.
158
Artificial Neural Networks for Data Mining
problem before they are tested for their ‘inference’ capability on unknown instances of
the problem. They can, therefore, identify new objects previously untrained.
Possess the capability to generalize. Thus, they can predict new outcomes from past
trends.
Robust systems and are fault tolerant. They can, therefore, recall full patterns from
incomplete, partial or noisy patterns.
160
Artificial Neural Networks for Data Mining
Given enough data, enough hidden units, and enough training time, an MLP with just one hidden
layer can learn to approximate virtually any function to any degree of accuracy. (A statistical
analogy is approximating a function with nth order polynomials.) For this reason MLPs are
known as universal approximators and can be used when you have little prior knowledge of the
relationship between inputs and targets. Although one hidden layer is always sufficient provided
you have enough data, there are situations where a network with two or more hidden layers may
require fewer hidden units and weights than a network with one hidden layer, so using extra
hidden layers sometimes can improve generalization.
161
Artificial Neural Networks for Data Mining
7. Learning of ANNs
The most significant property of a neural network is that it can learn from environment, and can
improve its performance through learning. Learning is a process by which the free parameters of
a neural network i.e. synaptic weights and thresholds are adapted through a continuous process
of stimulation by the environment in which the network is embedded. The network becomes
more knowledgeable about environment after each iteration of learning process. There are three
types of learning paradigms namely, supervised learning, reinforced learning and self-organized
or unsupervised learning.
incorporates an external teacher, so that each output unit is told what its desired response to input
signals ought to be. During the learning process global information may be required. An
important issue concerning supervised learning is the problem of error convergence, i.e. the
minimization of error between the desired and computed unit values. The aim is to determine a
set of weights which minimizes the error.
proposed by Masters (1993). For a three layer network with n input and m output
neurons, the hidden layer would have sqrt(n*m) neurons.
(c) Number of output nodes: Neural networks with multiple outputs, especially if these
outputs are widely spaced, will produce inferior results as compared to a network with a
single output.
(d) Activation function: Activation functions are mathematical formulae that determine the
output of a processing node. Most units in neural network transform their net inputs by
using a scalar-to-scalar function called an activation function, yielding a value called
the unit's activation. Except possibly for output units, the activation value is fed to one
or more other units. Activation functions with a bounded range are often called
‘squashing functions’. Appropriate differentiable function will be used as activation
function. Some of the most commonly used activation functions are :
- The sigmoid (logistic) function
f ( x) (1 exp( x)) 1
- The hyperbolic tangent (tanh) function
f ( x) (exp( x) exp( x)) / (exp( x) exp( x))
- The sine or cosine function
f ( x) sin( x) or f ( x) cos( x)
Activation functions for the hidden units are needed to introduce non-linearity into the networks.
The reason is that a composition of linear functions is again a linear function. However, it is the
non-linearity (i.e. the capability to represent nonlinear functions) that makes multilayer networks
so powerful. Almost any nonlinear function does the job, although for back-propagation learning
it must be differentiable and it helps if the function is bounded. Therefore, the sigmoid functions
are the most common choices. There are some heuristic rules for selection of the activation
function. For example, Klimasauskas (1991) suggests logistic activation functions for
classification problems which involve learning about average behaviour, and to use the
hyperbolic tangent functions if the problem involves learning about deviations from the average
such as the forecasting problem.
sufficient provided we have enough data. Schematic representation of neural network is given in
Fig. 5
Inputs
Outputs
Fig. 5: Schematic representation of neural network
Output vector
Target vector
Differences
ANN
=
model
Adjust weights
165
Artificial Neural Networks for Data Mining
variables, the weights and bias. If we also define a differentiable error function of the network
outputs such as the sum of square error function, then the error function itself is a differentiable
function of the weights. Therefore, we can evaluate the derivative of the error with respect to
weights, and these derivatives can then be used to find the weights that minimize the error
function by either using optimization method. The algorithm for evaluating the derivative of the
error function is known as backpropagation, because it propagates the errors backward through
the network. Multilayer feed forward neural network or multilayered perceptron (MLP), is very
popular and is used more than other neural network type for a wide variety of tasks. MLP learned
by backpropagation algorithm is based on supervised procedure, i.e. the network constructs a
model based on examples of data with known output. The Backpropagation Learning Algorithm
is based on an error correction learning rule and specifically on the minimization of the mean
squared error that is a measure of the difference between the actual and the desired output. As all
multilayer feedforward networks, the multilayer perceptrons are constructed of at least three
layers (one input layer, one or more hidden layers and one output layer), each layer consisting of
elementary processing units (artificial neurons), which incorporate a nonlinear activation
function, commonly the logistic sigmoid function.
The algorithm calculates the difference between the actual response and the desired output of
each neuron of the output layer of the network. Assuming that yj(n) is the actual output of the jth
neuron of the output layer at the iteration n and dj(n) is the corresponding desired output, the
error signal ej(n) is defined as:
e j (n ) d j (n ) y j (n )
2
The instantaneous value of the error for the neuron j is defined as e j (n ) / 2 and correspondingly,
2
the instantaneous total error E(n) is obtained by summing the neural error e j (n ) / 2 over all
neurons in the output layer. Thus,
1
2
E(n ) e j (n )
2 j
In the above formula, j runs over all the neurons of the output layer. If we define N to be the total
number of training patterns that consist the training set applied to the neural network during the
training process, then the average squared error Eav is obtained by summing E(n) over all the
training patterns and then normalizing with respect to the size N of the training set. Thus,
1 N
E av E(n )
2 n 1
It is obvious, that the instantaneous error E(n), as well as the average squared error Eav, is a
function of all the free parameters of the network. The objective of the learning process is to
modify these free parameters of the network in such a way that Eav is minimized. To perform this
minimization, a simple training algorithm is utilized. The training algorithm updates the synaptic
weights on a pattern-by-pattern basis until one epoch, that is, one complete presentation of the
entire training set is completed. The correction (modification) w ji (n ) that is applied on the
synaptic weight w ij (indicating the synaptic strength of the synapse originating from neuron i
and directing to neuron j), after the application of the nth training pattern is proportional to the
E (n )
partial derivative . Specifically, the correction applied is given by:
w ji (n )
166
Artificial Neural Networks for Data Mining
E(n )
w ij
w ji (n )
In the above formula (this is also known as delta rule), η is the learning-rate parameter of the
back-propagation algorithm. The use of the minus sign in above equation accounts for the
gradient-descent in weight-space, reflecting the seek of a direction for weight change that
reduces the value of E(n). The exact value of the learning rate η is of great importance for the
convergence of the algorithm since it modulates the changes in the synaptic weights, from
iteration to iteration. The smaller the value of η, the smoother the trajectory in the weight space
and the slower the convergence of the algorithm. On the other hand, if the value of η is too large,
the resulting large changes in the synaptic weights may result the network to exhibit unstable
(oscillatory) behaviour. Therefore, the momentum term was introduce for generational of the
above equation, Thus
E(n )
w ij w ji (n 1)
w ji (n )
In this equation α is the is a positive number called the momentum constant is called the
Generalized Delta Rule and it includes the Delta Rule as a special case (α =0). The weight update
can be obtained as
w ij (n ) w ji (n 1) j (n ) y i (n )
The weight adjustment w ji is made only after the entire training set has been presented to the
network (Konstantinos, A.; 2000).
With respect to the convergence rate the back-propagation algorithm is relatively slow. This is
related to the stochastic nature of the algorithm that provides an instantaneous estimation of the
gradient of the error surface in weight space. In the case that the error surface is fairly flat along
a weight dimension, the derivative of the error surface with respect to that weight is small in
magnitude, therefore the synaptic adjustment applied to the weight is small and consequently
many iterations of the algorithms may be required to produce a significant reduction in the error
performance of the network.
9. Evaluation criteria
The most common error function minimized in neural networks is the sum of squared errors.
Other error functions offered by different software include least absolute deviations, least fourth
powers, asymmetric least squares and percentage differences.
10. Conclusions
ANNs has an ability to learn by example makes them very flexible and powerful which make
them quite suitable for a variety of problem areas. Hence, to best utilize ANNs for different
problems, it is essential to understand the potential as well as limitations of neural networks. For
some tasks, neural networks will never replace conventional methods, but for a growing list of
applications, the neural architecture will provide either an alternative or a complement to these
existing techniques. ANNs have a huge potential for prediction and classification when they are
integrated with Artificial Intelligence, Fuzzy Logic and related subjects.
167
Artificial Neural Networks for Data Mining
The snapshots for Opening of project, importing the file from the desire directory and
linking of Models (ANNs ) to data file in SAS miner are given below.
168
Artificial Neural Networks for Data Mining
169
Artificial Neural Networks for Data Mining
References
Anderson, J. A. (2003). An Introduction to neural networks. Prentice Hall.
Chakraborty, S., Ghosh. R, Ghosh, M. , Fernandes, C.D. and Charchar, M.J. (2004). Weather-
based prediction of anthracnose severity using artificial neural network models. Plant
Pathology, 53, 375-386.
Cheng, B. and Titterington, D. M. (1994). Neural networks: A review from a statistical
perspective. Statistical Science, 9, 2-54.
Dewolf, E.D., and Francl, L.J., (1997). Neural network that distinguish in period of wheat tan
spot in an outdoor environment. Phytopathalogy, 87(1) pp 83-87.
Dewolf, E.D. and Francl, L.J. (2000) Neural network classification of tan spot and stagonespore
blotch infection period in wheat field environment. Phytopathalogy, 20(2), 108-113 .
Gaudart, J. Giusiano, B. and Huiart, L. (2004). Comparison of the performance of multi-layer
perceptron and linear regression for epidemiological data. Comput. Statist. & Data
Anal., 44, 547-70.
Hassoun, M. H. (1995). Fundamentals of Artificial Neural Networks. Cambridge: MIT Press.
Hebb,D.O. (1949) The organization of behaviour: A Neuropsychological Theory, Wiley, New
York
Hopfield, J.J. (1982). Neural network and physical system with emergent collective
computational capabilities. In proceeding of the National Academy of Science (USA)
79, 2554-2558.
Kaastra, I. and Boyd, M.(1996): Designing a neural network for forecasting financial and
economic time series. Neurocomputing, 10(3), pp 215-236 (1996)
Klimasauskas, C.C. (1991). Applying neural networks. Part 3: Training a neural network, PC-AI,
May/ June, 20–24.
Konstantinos, A. (2000). Application of Back Propagation Learning Algorithms on Multilayer
Perceptrons, Project Report, Department of Computing, University of Bradford,
England.
Mcculloch, W.S. and Pitts, W. (1943) A logical calculus of the ideas immanent in nervous
activity. Bull. Math. Biophy., 5, 115-133
Patterson, D. (1996). Artificial Neural Networks. Singapore: Prentice Hall.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage ang
organization in the brain. Psychological review, 65, 86-408.
Rumelhart, D.E., Hinton, G.E and Williams, R.J. (1986). “Learning internal representation by
error propagation”, in Parallel distributed processing: Exploration in microstructure of
cognition, Vol. (1) ( D.E. Rumelhart, J.L. McClelland and the PDP research gropus,
edn.) Cambridge, MA: MIT Press, 318-362.
Saanzogni, Louis and Kerr, Don (2001) Milk production estimate using feed forward artificial
neural networks. Computer and Electronics in Agriculture, 32, 21-30.
Schalkoff, R. J. (1997). Artificial neural networks. The Mc Graw-Hall
Warner, B. and Misra, M. (1996). Understanding neural networks as statistical tools. American
Statistician, 50, 284-93.
Widrow, B. and Hoff, M.E. (1960). Adapative switching circuit. IREWESCON convention
record, 4, 96-104
Yegnanarayana, B. (1999). Artificial Neural Networks. Prentice Hall
Zhang, G., Patuwo, B. E. and Hu, M. Y. (1998). Forecasting with artificial neural networks: The
state of the art. International Journal of Forecasting,14, 35-62.
170