You are on page 1of 18

Trainable Pattern Classifiers-The Deterministic Approach

Here we would study classifiers whose decision functions are generated from training patterns by means of
iterative, "learning" algorithms. We know that once a type of decision function has been specified, the
problem is the determination of the coefficients. Here some algorithms would be discussed that are capable
of learning the solution coefficients from the training sets whenever these training pattern sets are separable
by the specified decision functions.
Frank Rosenblatt (!"#$ introduced the most primitive type of trainable pattern classifier in the form of a
%rtificial &eural &etwork (%&&$ which is known as 'ingle()ayer(*erceptron (')*$.
What is a Neural Network ?
% prototype nerve cell called neurone is shown below. +lectrical impulses propagating along the a,on (a,on
potentials$ activate the synaptic -unctions. .hese, in turn, produce further e,citations (post synaptic
potentials$ which travel along the dendrites towards the ne,t
Figure 1 A protot!pe neurone
.he firing rate of each neurone is controlled by the region where the a,on -oins the cell body, called the
hillock /one . When the membrane potential at the hillock /one rises above a certain threshold value around
("0 m 1, it causes a travelling wave of charge to propagate. .he neurone must restore itself to its proper
resting state of balance before sending out the ne,t packet of charge, called the refractory period.
'o information is passed via synapses. .he synapses are termed e,citatory, or inhibitory depending on
whether the post(synaptic potentials increase or reduce the hillock potential, enhancing or reducing the
likehood of triggering an impulse there respectively.
.he current level of understanding of the brain function is so primitive that not even one area of brain is yet
un(understood. .hus artificial neural network only tries to mimic the biological neural network in a very
crude and primitive manner.
#234""22.doc of 5
What is a ANN ?
%n artificial neural network is a paralled distributed information processing structure in the form of a
directed graph, with the following sub(definitions and restrictions6
1. .he nodes of the graph are called processing elements.
2. .he links of the graph are called connections. +ach connection function as an instantaneous
unidirectional signal(condition path.
3. +ach processing element can receive any number of incoming connections (also called input
connections$.
4. +ach processing element can have any number of outgoing connections, but the signals in all of these
must be the same.
5. *rocessing elements can have local memory.
". +ach processing element possesses a transfer function, which can use (and alter$ local memory, can
use input signals, and which produces the processing element7s output signal. .ransfer functions can
operate continuously or episodically. 8f they operate episodically, there must be an input called
activate that causes the processing element7s transfer function to operate on the current input
signals and local memory values and to produce and update output signal (and possibly to modify
local memory values$. 9ontinuous processing elements are always operating. .he :activate; input
arrives via a connection from a scheduling processing element that is part of the network.
3. 8nput signals to a neural network from outside the network arrive via connection that originate in the
outside world. <utputs from the network to the outside world are connections that leave the network.
A Neural Processing "lement
Figure # A neural processing element
.he input signals
n
,..., ,
#
arriving at the processing element are supplied to the transfer function, as
is the "activate" input. .he transfer function of an episodically updated processing element, when activated,
uses the current values of the input signals, as well as values in local memory, to produce processing
element7s new output signal value y.
#234""22.doc # of 5
Another t!pical mo$el of neurone
Figure % Another t!pical mo$el of neurone
&ote that the strength of each synaptic -unction is represented by a multiplicative factor, or weight with a
positive sign for e,citatory connections, and negative otherwise. .he hillock /one is modeled by a
summation of the signals received from every link. .he firing rate of the neurone in response to the
aggregate signal is then described by a mathematical function, whose value represents the fre=uency of
emission of electrical impulses along the a,on.
.he general artificial neuron model has five components, shown in the following list. (.he subscript i
indicates the i(th input or weight.$
1. % set of inputs,
i
!
2. % set of weights,
i
"
3. % bias
4. %n activation function, f.
5. &euron output, #
#234""22.doc 2 of 5
Table 1 % comparison of neural networks and conventional computers.
Neural networks Con&entional computers
>any simple processors Few comple, processors
Few processing steps >any computational steps
?istributed processing 'ymbolic processing
.rained by e,ample +,plicit programming
#234""22.doc 4 of 5
$lgorithms %ype &unction
Hopfield recursive optimi/ation
>ulti(layered perceptron feedforward classification
@ohonen self(organi/ing data coding
.emporal differences predictive forecasting
)+%?+R
9)A'.R8&B
%)B<R8.H>
B%A''8%&
9)%''8F8+R
@<H<&+&
'+)F(<RB%&8C8&B
F+%.AR+ >%*
*+R9+*.R<&
9%R*+&.+RD
9R<''E+RB
9)%''8F8+R
H<*F8)+?
&+.
A&'A*+R18'+
?
'A*+R18'+?
9<&.8&A<A'(1%)A+? 8&*A. E8&%RF 8&*A.
&+AR%) &+. 9)%''8F8+R'
A&'A*+R18'+
?
'A*+R18'+?
H%>>8&B
&+.
>A).8()%F+R
*+R9+*.R<&
<*.8>A>
9)%'%'8F8+R
@(&+%R+'.
&+8BHE<R,
>8GHAR+
@(>+%&'
9)A'.R8&B
%)B<R8.H>
Figure ' A ta(onom! of si( neural nets that can be use$ as classifiers
)ingle *a!er Perceptron
.he single layer perceptron consists with only one node that can be used with both continuous valued and
binary inputs. % perceptron that decides whether an input belongs to one of two classes (denoted % or E$ is
shown below. .he single node computes a weighted sum of the input elements, subtracts a threshold
$ (
and passes the result through a hard limiting nonlinearity such that the output y is either H or (. for class
% and class E respectively.
.he perceptron forms two decision regions separated by a hyperplane which in #(? is a line. %s can be seen,
the e=uation of the boundary line depends on the connection weights and the threshold.
9onnection weights and the threshold in a perceptron can be fi,ed or adapted using a number of different
algorithms. .he original perceptron convergence procedure for ad-usting weights was developed by
Rosenblatt. 8t is described below.
First connection weights and the threshold value are initiali/ed to small random non(/ero values. .hen a
new input with ' continuous valued elements is applied to the input and the output is computed as in Fig.
#. 9onnection weights are adapted only when an error occurs using the formula in step 4. .his formula
includes a gain term that ranges from 0.0 to .0 and controls the adaptation rate.
#234""22.doc I of 5
The Perceptron Con&ergence Proce$ure
)tep 1 +nitiali,e weights an$ Threshol$

'et
$ 0 ( 6 $ 0 ( ' i "
i
and to small random values.
Here
$ (t "
i
is the weight from input i at time t and is the threshold in the output mode.
)tep # Present New +nput an$ Desire$ -utput

*resent new continuous valued input
0
,..., ,
'
! ! !
along with the desired output
$. (t d

)tep % Calculate Actual -utput

,
_

0
$ ( $ ( $ (
'
i
i i
t ! t " f t y
)tep ' A$apt Weights

'


+ +
( class from input if
$ class from input if
t d
' i
t ! t y t d t " t "
i i i

$ (
0
$ ( $J ( $ ( K $ ( $ (
8n these e=uations

is a positive gain fraction less than and


$ (t d
is the desired correct
output
for the current input. &ote that weights are unchanged if the correct decision is made by the net.
)tep . /epeat b! 0oing to )tep #
&ote that the gain term

must be ad-usted to satisfy the conflicting re=uirements of fast adaptation for


real changes in the input distributions and averaging of past inputs to provide stable weight estimates.
Rosenblatt proved that if the inputs presented from the two classes are separable (that is they are in opposite
sides of some hyperplane$, then the perceptron convergence procedure converges and positions the decision
hyperplane between those two classes.
<ne problem with the perceptron convergence procedure is that decision boundaries may oscillate continu(
ously when inputs are not separable and distributions overlap.
#234""22.doc " of 5
12*T+-*A3"/ P"/C"PT/-N
>ulti(layer perceptrons are feed(forward nets with one or more layers (called hidden layers$ of nodes
between the input and output nodes. % three(layer perceptron with two layers of hidden units is shown
below. >ulti(layer perceptrons overcome many of the limitations of single(layer perceptron, and shown to
be successful for many problems of interest.
Figure . 4 % >)* with one hidden layer
.he capabilities of perceptrons with one, two, and three layers that use hard(limiting nonlinearities are
illustrated in the above Figure. .he second column in this figure indicates the types of decision regions that
can be formed with different nets. .he ne,t two columns present e,amples of decision regions which
could be formed for the e,clusive <R problem and a problem with meshed regions. .he
rightmost column gives e,amples of the most general decision regions that can be formed.
Figure " 6 *attern classification using >)*.
.he decision boundary provided by ')* is given by
#234""22.doc 3 of 5

,
_

0
$ ( $ ( $ (
'
i
i i
t ! t " f t y
)etting
0
, we get

'

<
>
#

0
0
$ (

class from input if


class from input if
t y
8n other words, we want to find a solution weight vector " with the property that
0 > ! " for all patterns of

and 0 < ! " for all patterns of


#

. LLLLLLLLLLL($
8f the patterns of
#

are multiplied by ( , we obtain the e=uivalent condition 0 > ! " for all patterns.
)etting ' represent the total number of augmented sample patterns in both classes, we may e,press the
problem as one of finding a vector " such that the system of ine=ualities
0 > ! " LLLLLLLLLLLLLLLLLLLL(#$
is satisfied, where

,
_

'
!
!
!
!

, $ , ,..., , (
#

+ n n
w w w w "
and 0 is the /ero vector.
8f there e,ists a " which satisfies e,pression (#$, the ine=ualities are said to be consistent) otherwise, they
are inconsistent.
Following the condition given in ($, the *erceptron %lgorithm is given by
8f

$ ( t !
and
, 0 $ ( $ ( > t ! t "
let
$ ( $ ( t " t " +
otherwise
replace
$ (t "
by
$ ( $ ( $ ( t ! t " t " + +
where

is the correction factor


8f
#
$ ( t !
and
, 0 $ ( $ ( < t ! t "
let
$ ( $ ( t " t " +

otherwise
#234""22.doc 5 of 5
replace
$ (t "
by
$ ( $ ( $ ( t ! t " t " +
Here the amount of weight correction is not proportional to the amount of error, but a constant fraction of
the input being misclassified.
#234""22.doc ! of 5
Problem 4 Apply the perceptron algorithm to the following augmented patterns to find a solution weight
vector for a two class problem.
.he classes are M $ , , 0 ( , $ , 0 , 0 N( 6

and M. $ , , ( , $ , 0 , N( 6
#

)etting

and
, 0 $ ( "
and presenting the patterns in the above order, results in the following
se=uence of steps6

,
_

,
_

0
0
$ ( $ ( $ # ( , 0

0
0
$ 0 , 0 , 0 ( $ ( $ ( ! " " ! "

,
_

,
_

0
0
$ # ( $ 2 ( ,

0
$ , 0 , 0 ( $ # ( $ # ( " " ! "
#234""22.doc 0 of 5

,
_

,
_

0
0

$ 2 ( $ 2 ( $ 4 ( ,

$ , 0 , 0 ( $ 2 ( $ 2 ( ! " " ! "

,
_

,
_

0
0

$ 4 ( $ I ( ,

$ 0 , 0 , ( $ 4 ( $ 4 ( " " ! "


where corrections on the weight vector were made in the first and third steps because of misclassification,
'ince a solution has been obtained only when the algorithm yields a complete, error(free iteration through all
patterns, the training set must be presented again.
.he machine learning process is continued by letting
$, 2 ( $ 3 ( $, # ( $ " ( $, ( $ I ( ! ! ! ! ! !
and
$. 4 ( $ 5 ( ! !
.he second iteration through the patterns yields6

,
_

$ I ( $ I ( $ " ( , 0 $ I ( $ I ( ! " " ! "


#234""22.doc of 5

,
_

$ " ( $ 3 ( , $ " ( $ " ( " " ! "

,
_

0
0
#
$ 3 ( $ 3 ( $ 5 ( , 0 $ 3 ( $ 3 ( ! " " ! "

,
_

0
0
#
$ 5 ( $ ! ( , # $ 5 ( $ 5 ( " " ! "
'ince two errors occurred in this iteration, the patterns are presented again.

,
_

0
#
$ ! ( $ ! ( $ 0 ( , 0 $ ! ( $ ! ( ! " " ! "

,
_

0
#
$ 0 ( $ ( , $ 0 ( $ 0 ( " " ! "
#234""22.doc # of 5

,
_

0
#
$ ( $ # ( , $ ( $ ( " " ! "

,
_

0
#
$ # ( $ 2 ( , $ # ( $ # ( " " ! "
8t is easily verified that in the ne,t iteration all patterns are classified correctly. .he solution vector is,
therefore,
. $ , 0 , # ( "
.he corresponding decision function is
, # $ (

+ ! d
which, when set e=ual
to /ero, becomes the e=uation of the decision boundary shown in Figure below.
Figure 4 (a$ *atterns belonging to two classes. (b$ ?ecision boundary determined by training
%ccording to +=.(#$, we may e,press the perceptron algorithm in an e=uivalent form by multiplying the
augmented patterns of one class by ( . .hus, arbitrarily multiplying the patterns of
#

by ( , we can write
the perceptron algorithm as
#234""22.doc 2 of 5

'

+
>

+
0 $ ( $ ( $ ( $ (
0 $ ( $ ( $ (
$ (
t ! t " if t ! t "
t ! t " if t "
t "

where

is a positive correction increment.


#234""22.doc 4 of 5
The 0ra$ient Techni5ue
.he gradient techni=ue provide a tool for finding the minimum of a function. We know from vector analysis
that the gradient of a function
$ ( y f
with respect to the vector
$ ,..., , (
#

n
y y y y
is defined as
grad

,
_


n
y
f
y
f
y
f
dy
y df
y f

$ (
$ (
We see from this e=uation that the gradient of a scalar function of a vector argument is a vector and that each
component of the gradient gives the rate of change of the function in the direction of that component.
<ne of the most important properties of the gradient vector is that it points in the direction of the ma,imum
rate of increase of the function f when the argument increases. 9onversely, the negative of the gradient
points in the direction of the ma,imum rate of decrease of f.
<n the basis of this property, we can devise iterative schemes for finding the minimum of a function. <nly
function with a uni=ue minimum will be considered.
.he approach we shall take to finding a solution to the set of liner ine=ualities
0 >
i
w
will be a define a
criterion function *(w$ that is minimi/ed if w is a solution vector. .his reduce our problem to one of
minimi/ing a scalar function(w(((a problem that can often be solved by a gradient descent procedure.
Here we start with some arbitrarily chosen weight vector w617 and compute the gradient vector
$$. ( (w *
.he ne,t value w6#7 is obtained by moving some distance from w617 in the direction of
steepest descent, i.e., along the negative of the. gradient. 8n general, w+k H $ is obtained from w+k, by the
e=uation
$$, ( ( $ ( $ ( $ ( k w * k k w k w +
where

is a positive scale factor or learning rate that sets the step si/e. 'uch a se=uence of weight vectors
will converge to a solution minimi/ing * (w$. 8n algorithmic form we have
Algorithm 68asic 0ra$ient Descent7
1 begin initiali,e w9 threshold
0 (.$, , k
# $o + k k
%
$ ( $ ( w * k w w
#234""22.doc I of 5
' until
< $ ( $ ( w * k
. return w
: en$
8f
$ (k
is too small, convergence needlessly slow, whereas if
$ (k
is too large, the correction process will
overshoot and(can even diverge.
%nother function which achieves its minimum value whenever
0 >
i
! "
is given by
$ ( $ , ( ! " ! " ! " *
LLLLLLLLLLLLL($
where
! "
is the absolute value of ! " . 8t is evident that the minimum of this function is
0 $ , ( ! " *
and that this minimum results when . 0 > ! " We are e,cluding, of course, the trivial case in
which 0 " .
.he approach employed below consists of incrementing " in the direction of the negative gradient of
$ , ( ! " *
in order to seek the minimum of the function. 8n other words, if we let
$ (k "
represent the value
of " at the k th step, the general gradient descent algorithm may be written as
$ (
$ , (
$ ( $ (
k " " "
! " *
c k " k "

)

'

+
LLLLLLLLLL.(#$
where
$ ( + k "
represents the new value of " , and 0 > c dictates the magnitude of the correction. 8t is
noted that no corrections are made on " when
, 0 $ D ( " *
which is the condition for a minimum.
+=uation (#$ may be interpreted geometrically with the aid of Figure below.
#234""22.doc " of 5
Figure I.2. Beometrical illustration of the gradient descent algorithm
#234""22.doc 3 of 5
We see that, if
$ D ( " *
is negative at the k th step, " is incremented in the direction of the minimum
of * . 8t is evident from the figure that this descent scheme will eventually lead to a positive " and,
conse=uently, to the minimum value of * .
8t should also be noted that the above figure is a plot of +=. ($ for . 9learly, there are as many curves
as there are patterns in a problem.
8f the ine=ualities are consistent and a proper
$ , ( ! " *
is chosen, the algorithm of +=. (#$ will result in a
solution. <therwise, it will simply oscillate until the procedure is stopped.
#234""22.doc 5 of 5

You might also like