Perceptrons For Regression & Classification

Neural Networks Assignment 1
Candidate Number: 19214

Part A (1):
Illustration of Training Process
Setup of training process
Parameter Setting
0.5
bias 1.0
Error function erce!tron criterion
"eig#t u!date rule $e%uential &radient 'escent
Acti(ation function )ea(iside: * + ,1 if -. / 001 21 ot#erwise
3n!ut attern4 5arget
class 1 + -11 104 -01 104 target + ,1
class 2 + -01 004 -11 004 target + 21
3nitial "eig#ts w
5
+ -w
0
1 w
1
1 w
2
0 + -01 01 00
3 am using t#e erce!tron criterion as error function because of its information about t#e
current error gradient and 3 am using $e%uential &radient 'escent because in a %uick
e.!erimental e(aluation 3 found t#at it con(erges faster t#an 6atc# &radient 'escent -see
t#e discussion in art A -20 below0.
After initialisation1 t#e weig#t (ector is at t#e origin1 #ence wit# t#e gi(en in!ut1 7
!atterns 1 -11 101 -01 10 and -11 00 are misclassi8ed after t#e 8rst training e!oc#. 5#e in!ut -01
00 is classi8ed correctl* as its acti(ation (alue is 01 and as t#e acti(ation function #as a #ard
limit at . / 01 t#e !attern is t#erefore classi8ed wit# a !redicted target (alue of 21.
1 of 20
5able 1: $etu! of 5raining rocess
lot 1: $tate of t#e Network after 3nitialisation
After t#e 8rst e!oc# t#e weig#t (ector is u!dated as w
5
+ -20.51 01 0.501 resulting in a
decision boundar* of .
2
+ 2w
0
9w
2
1 w#ic# results in a #ori:ontal line at .
2
+ 1.
3n t#is training e!oc#1 resulting in 2 in!ut !atterns1 -11 10 and -11 00 being misclassi8ed and
an u!dated weig#t (ector of w
5
+ -20.51 01 10. 5#is s#ifts t#e decision boundar* down
towards t#e origin1 resulting in a #ori:ontal line at .
2
+ 0.5. "it# t#is weig#t u!date1 all
!oints are correctl* classi8ed and t#erefore training ends after t#e ne.t e!oc#1 w#ere t#e
weig#ts are left unc#anged as t#e correct decision boundar* #as alread* been learnt. 5#e
8nal weig#t (ector t#erefore is w
5
+ -20.51 01 10 and is dis!la*ed in lot 7 below.
2 of 20
lot 2: $tate of t#e Network after 5raining E!oc# 1
lot 7: $tate of t#e Network after 5raining E!oc# 2
Learnability
5#e !erce!tron is able to learn an* linearl* se!arable in!ut1 out of t#e ; different in!ut
combinations1 4 are linearl* se!arable and can be learnt b* t#e erce!tron1 t#e ot#er 21
re!resenting <=> and <N=> res!.1 are not linearl* se!arable and #ence1 cannot be learnt
b* t#e erce!tron. 5able 2 below gi(es an o(er(iew o(er w#ic# in!ut combinations t#e
erce!tron can learn.
N6: 3n essence t#e ; different in!ut combinations conform to 7 uni%ue in!ut !attern1 t#at
is1 w#en class 1 + -01 004 -11 10 and class 2 + -01 104 -11 001 t#en t#ere is an in!ut combination
w#ere class 2 + class 1 and class 1 + class 21 w#ic# re!resents t#e same !attern1 ?ust wit#
different class ad#erence1 and #ence1 t#e erce!tron is able to learn 2 out of 7 uni%ue in!ut
!atterns.
# Input Patterns Target Learnable Comment
1
class 1 + -01 004 -01 10 ,1
@es
class 2 + -11 004 -11 10 21
2
class 1 + -01 004 -11 00 ,1
@es
class 2 + -01 104 -11 10 21
7
class 1 + -01 004 -11 10 ,1
No <N=>
class 2 + -01 104 -11 00 21
4
class 1 + -01 104 -11 00 ,1
No <=>
class 2 + -01 004 -11 10 21
5
class 1 + -01 104 -11 10 ,1
@es
class 2 + -01 004 -11 00 21
;
class 1 + -11 004 -11 10 ,1
@es
class 2 + -01 004 -01 10 21
7 of 20
5able 2: Aearnabilit* =(er(iew
Epocs until Con!ergence
Bor t#e illustrated training !rocedure wit# + 0.51 bias + 1.0 and initial weig#t (ector of w
5
+ -01 01 001 t#e learning algorit#m con(erged after 227 e!oc#s -for t#e linearl* se!arable
!roblems01 5able 7 gi(es a s#ort o(er(iew of t#e number of e!oc#s until con(ergence !er
in!ut !attern for t#e abo(e mentioned starting !arameters.
Input Patterns Epocs until con!ergence
class 1 + -11 004 -11 101 target + ,1
2
class 2 + -01 004 -01 101 target + 21
class 1 + -01 004 -01 101 target + ,1
7
class 2 + -11 004 -11 101 target + 21
class 1 + -01 004 -01 101 target + ,1
7
class 2 + -11 004 -11 101 target + 21
class 1 + -11 104 -01 101 target + ,1
7
class 2 + -01 004 -11 001 target + 21
3n general 3 found t#at con(ergence itself1 t#at is t#e fact w#et#er or not t#e !roblem is
learnable or not1 is inde!endent of t#e (alues of and t#e bias1 and t#at furt#er1 t#e (alue
of doesnCt affect t#e number of e!oc#s taken until con(ergence1 #owe(er t#e bias (alue
does #a(e an im!act on t#e number of e!oc#s for t#e gi(en setu!. lot 4 dis!la*s t#e
number of e!oc#s until con(ergence for (ar*ing (alues of + D1.51 1.01 0.51 0.11 0.051 0.011
0.0051 0.001E and t#e bias + D221 21.51 21.01 20.51 01 0.51 11 1.51 2E and an initial weig#t (ector of
w
5
+ -01 01 00.
4 of 20
5able 7: E!oc#s until Con(ergence =(er(iew
lot 4: Error $urface for different (alues of and t#e bias
Part A ("):
Training Process
Basic Setup of training process
Parameter Setting
0.05
bias 1.0
Error function erce!tron criterion
Acti(ation function )ea(iside: * + ,1 if -. / 001 21 ot#erwise
3 decided to use erce!tron criterion as error function because its gi(ing me information
about t#e error gradient w#ic# is needed to !erform &radient 'escent. 3 furt#er c#ose to
use $e%uential &radient 'escent in fa(our of 6atc# &radient 'escent because 3 found t#at
$e%uential &radient con(erged %uicker in m* e.!eriments. =(er 100 test runs1 $e%uential
&radient 'escent1 on a(erage con(erged after 1F e!oc#s1 w#ereas 6atc# &radient 'escent
took 21 e!oc#s for con(ergence on a(erage. 3 furt#er decided to stick wit# a bias (alue of
1.0 and to c#oose + 0.051 w#ic#1 after a few test runs1 a!!eared to be a reasonable c#oice
between granularit* and s!eed of con(ergence.
To Suf#e or not to Suf#e
6efore e(er* training e!oc#1 3 s#ufGed t#e w#ole dataset so as to not o(er8t t#e data. 3n
general 3 found t#at w#en s#ufGing is !erformed1 con(ergence usuall* takes longer -o(er
100 test runs1 a network t#at s#ufGes its data before e(er* e!oc# con(erged in 1F e!oc#s
on a(erage w#ereas a network wit#out s#ufGing con(erged in 17 e!oc#s on a(erage01 but
often results in more solid looking decision boundaries. )ence1 3 added s#ufGing of in!ut
data to m* training regime.
$eigt Initialisation
3 found t#at initialising t#e weig#ts in s!eci8c wa*s #as a signi8cant im!act on t#e
number of training e!oc# t#e network takes to con(erge. Bor t#e gi(en task1 we were to
sam!le 10 data !oints from 2 &aussian distributions wit# means
1
+ 0 and
2
+ 4
res!ecti(el*. lot 5 s#ows t#e 2 distributions wit# t#e o!timal decision boundar* -in a
6a*esian sense0 at . + 2. )ence1 3 e.!ected t#at an initial decision boundar* at . + 2 would
often alread* be t#e correct decision boundar* for t#e gi(en data !oints.
5 of 20
5able 4: 6asic 3nitialisation of t#e free !arameters
5#is #*!ot#esis turned out to be true1 in an e.!eriment wit# 1000 test runs1 initialising t#e
weig#ts so t#at t#e decision boundar* is a (ertical line at . + 2 turned out to be t#e correct
solution in ;14 out of 1000 runs. 5#e mean number of e!oc#s until con(ergence was 151
#owe(er t#is number is somew#at distorted b* t#e fact t#at not all of t#e 1000 !roblems
were linearl* se!arable1 w#ere t#e algorit#m terminated after 200 iterations. 3n
com!arison1 random weig#t initialisation took at least 2 e!oc#s to 8nd a decision
boundar* and t#is #a!!ened onl* 1H4 out of 1000 times1 t#us underlining t#e su!eriorit*
of m* weig#t initialisation met#od.
)owe(er1 3 found an e(en better weig#t initialisation met#od t#an setting a (ertical line at
. + 21 namel* 3 initialised t#e weig#ts b* using t#e Iinimum $%uared Error criterion1
w#ic# essentiall* is &radient 'escent wit# Aeast Iean $%uares as error function. As t#e
gradient is a(ailable in closed form for t#e t#is setu!1 no &radient 'escent was needed
and t#e Aeast Iean $%uares solution could be obtained directl*. 5#e ca(eat of t#is met#od
is t#at it ma* fail to 8nd a solution1 e(en if t#ere is one1 #ence 3 onl* used it as a wa* of
initialising t#e weig#ts for t#e network. 5#is initialisation resulted in being a solution in
;92 out of 1000 test runs1 out!erforming t#e (ertical line at . + 2 initialisation. 5#e a(erage
number of e!oc#s until con(ergence for t#is met#od was J14.;1 #owe(er as alread*
mentioned abo(e1 t#is number is slig#tl* distorted due to t#e fact t#at not all !roblems
#a(e been linearl* se!arable. 5aken t#e w#ole setu! furt#er1 t#e Iinimum $%uared Error
criterion would #a(e gi(en rise to t#e )o2Kas#*a! !rocedure1 w#ic# 3 started to
im!lement but lack of time !re(ented me from 8nis#ing an im!lementation
1
.
Note t#at for !ur!oses of better illustrating t#e network learning !rogress 3 generall*
initialised t#e weig#ts wit# random (alues.
1 &utierre:2=suna1 >icardo1 L17: Linear Discriminant Functions. 5e.as: 5e.as A L I
Mni(ersit*. A(ailable from: #tt!:99researc#.cs.tamu.edu9!rism9lectures9!r9!rNl1F.!df
-accessed 17
t#
Bebruar* 20140
; of 20
lot 5: 2 &aussian 'istributions
Con!ergence % &ecay
3 also e.!erimented wit# a deca* factor for t#e learning rate 1 w#ere 3 di(ided t#e
learning rate b* 2 e(er* 20 e!oc#s. 5#is a!!roac# introduces an additional ad(antage as
well as an additional disad(antage. 5#e merit being t#at in case t#e learning rate #as
initiall* been too large1 w#ic# could lead to t#e situation t#at a global minimum is being
o(ers#ot1 t#e deca* of t#e learning rate acts as a regulator to scale down until a
minimum can be reac#ed. 5#e drawback of t#is met#od is t#at it could slow down
con(ergence and cause t#e algorit#m to terminate wit#out #a(ing found a solution e(en if
t#ere would #a(e been one. As 3 didnCt want t#at to #a!!en1 3 disabled t#e deca* factor for
most e.!eriments.
Sigmoi'al Acti!ation !s( )ea!isi'e Acti!ation
5#e ma?or difference between t#e #ea(iside function and sigmoids -a!art from t#e fact
t#at t#e latter can be differentiated w#ereas t#e former cannot0 is t#at a sigmoidal
acti(ation function suc# as tan# is continuous1 w#ereas t#e #ea(iside acti(ation function is
unde8ned at acti(ationNmagnitude + 0. 5#e (alue returned b* a sigmoidal acti(ation
function can be inter!reted as t#e le(el of con8dence of t#e current classi8cation decision1
and in indeed t#e logistic acti(ation function re!resents t#e !robabilit* of a gi(en data
!oint for t#e gi(en class -CO.0. 3n ot#er words1 t#e (alue returned b* a sigmoidal
acti(ation function can also be seen as a distance measure to t#e current decision
boundar*. Bor t#e tan# acti(ation function1 t#e closer t#e (alue is to 0 t#e closer t#e current
data !oint is to t#e decision boundar*. A sigmoidal and t#e #ea(iside acti(ation function
s#are t#e fact t#at at some !oint a #ard limit needs to be a!!lied in order to get a
classi8cation decision. Bor t#e #ea(iside function as well as for tan# t#is is at
acti(ationNmagnitude + 01 w#ere t#e data !oint needs to be ma!!ed to t#e target s!ace in
some wa*.
3 em!iricall* e(aluated t#e a(erage e!oc#s for con(ergence for a tan# acti(ation function
and t#e #ea(iside acti(ation function and found t#at on a(erage a network trained wit#
t#e #ea(iside acti(ation function con(erges slig#tl* faster t#an wit# tan#. =(er 100 test
runs1 a network wit# tan# con(erged after 14 e!oc#s on a(erage w#ereas a network
trained wit# #ea(iside con(erged after 12.F e!oc#s on a(erage.
6ut t#is is not t#e onl* difference1 t#e resulting decision boundaries usuall* differ as well
as lots ; L F s#ow1 w#ere classi8cation was carried out wit# t#e same data !oints and t#e
same initialisation of t#e networks free !arameters. )owe(er1 inde!endent of t#e
acti(ation function1 onl* linearl* se!arable !roblems can be sol(ed.
F of 20
*on+linearly separable &ata
3f no e.it criterion after a gi(en number of e!oc#s would #a(e been su!!lied t#en t#e
algorit#m would not terminate for a non2linearl* se!arable !roblem. As lots H211 s#ow
t#e network is somew#at des!eratel* tr*ing to 8nd a solution t#at se!arates t#e 2 classes
but doesnCt 8nd one. 5#ese 4 lots re!resent t#e network after 91 101 11 and 12 training
e!oc#s res!ecti(el* and are based on sam!ling data from 2 &aussian distributions wit#
means
1
+ 0 and
2
+ 2. lot 12 s#ows t#e corres!onding error rate w#ic# is #ea(il*
oscillating as t#e algorit#m tries to 8t a decision boundar*. 3n contrast lot 17 s#ows t#e
error rate for a linearl* se!arable !roblem.
H of 20
lot ;: >esulting 'ecision 6oundar* wit# )ea(iside Acti(ation
Bunction
lot F: >esulting 'ecision 6oundar* wit# tan# Acti(ation Bunction
lot H212: C#ange of 'ecision 6oundar* in a linearl*
unse!arable !roblem.
Illustrating te Training Process
lots 15221 s#ow t#e learning !rocess for a linearl* se!arable !roblem wit# t#e 7
!re(iousl* described weig#t initialisation met#ods. lot 15 re!resents t#e decision
boundar* w#en t#e weig#ts are initalised wit# t#e Iinimum $%uared Error criterion.
lots 1; L 1F s#ow t#e networks learning !rogress w#en t#e weig#ts are initialised b*
setting t#e decision boundar* at . + 2 and lot 1H s#ows t#e decision boundar* wit#
randoml* initialised weig#ts and lots 19 221 s#ow t#e last 7 training e!oc#s -out of ; in
total0 of t#e networks learning !rogress.
9 of 20
lot 17: Error rate for a non2linearl* se!arable !roblem lot 14: Error rate for a linearl* se!arable !roblem
lot 15: Network 'ecision 6oundar* wit# Iinimum $%uared Error
criterion weig#t initialisation.
10 of 20
lot 1;: Network 'ecision 6oundar* wit# . + 2 weig#t
initialisation. 'ue to an outlier of class 2 at J-1.H90.H01 t#e initialised
'ecision 6oundar* is not *et a solution.
lot 1F: Network 'ecision 6oundar* wit# . + 2 weig#t
initialisation. 5#e Network was able to learn a correct 'ecision
6oundar* after onl* 1 training e!oc#.
lot 1H: Network 'ecision 6oundar* wit# randoml* initialised
weig#ts before t#e 8rst training e!oc#.
lot 19: Network 'ecision 6oundar* wit# randoml* initialised
weig#ts1 after training e!oc# 4 of ;.
weig#ts1 after training e!oc# 5 of ;.
weig#ts1 after training e!oc# ; of ;. 'ecision 6oundar* successfull*
learnt.
Part , (1):
Parameter Setting
0.0001
bias 7 , mean-0
Error function Aeast Iean $%uared Error
3nitial "eig#ts w
5
+ -w
0
1 w
1
0 + -11 0.40
5#e task for t#e network is to 8nd t#e best 8t line for t#e gi(en data !oints. Bor a
regression scenario like t#e gi(en one1 t#e goal is to !redict a target (ariable * gi(en in!uts
from 1 P n (ariables. 3n essence1 for regression1 t#ere is no need for an acti(ation function1
as t#e acti(ation !roduced b* t#e network alread* re!resents t#e %uantit* of interest1
alt#oug# strictl* s!eaking1 one could argue t#at t#e network uses t#e identit* function as
its acti(ation function.
*et-or. Initialisation
5#e %uantit* is drawn from a uniform distribution in t#e inter(al Q210 ,10R and is added
to t#e interce!t term of t#e function * + 0.4. , 7 , 1 #ence gi(en an in8nite amount of
data !oints for t#e function1 3 would e.!ect t#e to con(erge to 0. As t#ere is onl* 1 in!ut
!arameter1 t#e regression line will be a straig#t line of t#e form * + k. , d1 wit# t#e
gradient being close to t#e gradient of t#e original function1 so 0.4.
3 c#ose to initialise t#e weig#t (ector as w
5
+ -w
0
1 w
1
0 + -11 0.40 and to use a bias (alue of 7 ,
mean-0. "it# an initialisation close to t#e underl*ing real function 3 was #o!ing to reduce
t#e number of training e!oc#s re%uired.
As for classi8cation1 3 also s#ufGed t#e data before eac# training e!oc# for t#e regression
task and found t#at network con(ergence took a lot longer. =(er 100 test runs1 a network
t#at s#ufGed t#e data before e(er* e!oc# re%uired 211 e!oc#s on a(erage for con(ergence1
w#ereas t#e a(erage con(ergence wit#out s#ufGing t#e data was 2 e!oc#s. )owe(er1 on
t#e ot#er #and1 a network t#at used s#ufGing !roduced slig#tl* better lines in terms of t#e
a(erage s%uared error t#e* !roduced. Again o(er 100 test runs t#e mean a(erage s%uared
error for a network wit# s#ufGing was 1;.021 w#ereas for a network wit#out s#ufGing t#e
error was 1F.2.
3 used $e%uential &radient 'escent in con?unction wit# Aeast Iean $%uares as m* error
function. 3 c#ose $e%uential &radient 'escent in fa(our of 6atc# &radient 'escent1
because in m* e.!eriments on linear regression1 $e%uential &radient 'escent generall*
con(erged faster and resulted in a smaller error and t#erefore a better regression line. Bor
100 test runs1 $e%uential &radient 'escent con(erged after 210 e!oc#s on a(erage for t#e
gi(en setu! w#ereas 6atc# &radient 'escent con(erged onl* after 750 e!oc#s on a(erage.
5#e a(erage of t#e mean s%uared error o(er 100 test runs for $e%uential &radient 'escent
was 1F.F91H1 w#ereas t#e mean s%uared error for 6atc# &radient 'escent was 19.22;F. 3
11 of 20
5able 5: 6asic 3nitialisation of t#e free !arameters
c#ose Aeast Iean $%uares as m* error function because it is sim!le to im!lement and for
t#e gi(en setu! and in con?unction wit# &radient 'escent1 is guaranteed to con(erge to a
global minimum -as long as t#e ot#er !arameters1 i.e. t#e learning rate1 are accordingl*
set0.
Con!ergence
Bor testing wet#er t#e algorit#m #as con(erged1 3 com!ared t#e current error to t#e error
of t#e !re(ious training e!oc# -a.k.a. !re(ious error0. 3f t#e difference between t#e current
error and t#e !re(ious error is below a !rede8ned t#res#old 1 t#en t#e algorit#m sto!s. 3
most commonl* used 0.0001 or 0.00001 for . 5#e second termination criterion was w#en
a !rede8ned number of e!oc#s #as been reac#ed. 3 most commonl* used (alues between
100 and 5001 w#ic# is %uite low1 but was suf8cient for t#e gi(en tasks.
/f te !irtues of Preprocessing
"#en running t#e network to 8nd a best28t line1 3 found it makes a #uge difference
w#et#er or not t#e data #a(e been !ro!erl* !re!rocessed1 for t#e following !aragra!#s 3
will t#erefore fre%uentl* com!are between t#e differences of a!!l*ing !re!rocessing to not
a!!l*ing an* !re!rocessing.
All m* !re!rocessing consisted of normalising t#e in!ut and target (alues and after
#a(ing found t#e best28t line1 con(erting t#e data back to its original s!ace -see Bormulas
1 L 20.
/n ,ias0 $eigts an' Training Error
Bor t#e training runs w#ere 3 didnCt !re!rocess t#e data1 8nding a solution was #ugel*
de!endent on t#e (alue of t#e learning rate1 w#ic# needed to be (er* small -+ 0.000010 in
order for t#e network to con(erge and !roduce a good28t line1 #owe(er1 t#e training error
didnCt constantl* decrease as 3 would #a(e e.!ected1 but was #ea(il* oscillating -see lot
220. 5#e weig#ts for t#e network con(erged towards w
1
S 0.4 and w
0
S 11 t#e actual results
for one test run were w
1
+ 0.4007 and w
0
+ 1.1599 res!ecti(el*.
12 of 20
Bormula 1: 'ata Normalisation
Bormula 2: ost!rocessing P con(erting t#e data
back to its original s!ace
Bor t#e training runs w#ere 3 normalised t#e data1 t#e network was less de!endent on
s!eci8c (alues for -3 usuall* #ad in t#e inter(al Q0.0001 0.001R0. )a(ing t#e data
normalised1 training error was now decreasing towards 01 w#ic# is illustrated in lot 27.
5#e weig#ts for a network wit# normalised data con(erged towards w
1
S 1 and w
0
S 0 -t#e
e.act 8gures were w
1
+ 0.9179 and w
0
+ 0.000901 w#ic# makes sense as t#e normalised
function #as a gradient of k + 1 and an interce!t of 0. )ence1 3 would conclude t#at t#e
weig#ts for a regression task con(erge to t#e gradient and t#e interce!t of t#e gi(en
function. Bormula 2 from abo(e #ad to be a!!lied1 to use t#e weig#ts1 learnt from a
normalised model1 toget#er wit# t#e original data.
17 of 20
lot 22: Error rate for raw in!ut data -no normalisation or ot#er
!re!rocessing carried out0.
lot 27: Error rate for normalised in!ut data.
Turning up te noise
3ncreasing t#e noise results in data !oints w#ere it is #arder to recognise a straig#t line as
t#e underl*ing function. 5#erefore1 b* increasing t#e random Guctuations1 t#e resulting
regression line becomes more #ori:ontal1 w#ic# means t#at t#e general trend of t#e data1
re!resented b* t#e gradient of t#e -underl*ing0 function1 can no longer be reliabl*
estimated. Bor e.am!le1 t#e learnt weig#t for t#e gradient wit# + Q250 ,50R is no longer
close to 11 but onl* 0.;11 resulting in a gradient of J0.25 for t#e regression line -see lots 24
L 250.
1o'ifying to un'erlying function
3 c#anged t#e function to * + 1.2. 2 2 , and initialised t#e weig#ts as w
0
+ 11 wit# a bias
(alue of 22 , mean-0 and w
1
+ 1.2. 5#e resulting weig#ts of t#e network again con(erged
close to t#e gradient of t#e underl*ing function -w
1
S 1.21 t#e e.act 8gure being w
1
+
1.1F0F0 and t#e interce!t - w
0
S 11 t#e e.act 8gure being w
0
+ 0.H44H0. Bor a network trained
on normalised data1 t#e weig#ts con(erged towards w
1
S 1 and w
0
S 0 res!ecti(el* -t#e
e.act 8gures being w
1
+ 0.9254 and w
0
+ 0.000020.
14 of 20
lot 24: + Q210 10R1 t#e Network is still able to ca!ture t#e general
trend of t#e data well1 wit# t#e learnt weig#ts con(erging towards
w
1
S 0.4 and w
0
S 1.
lot 25: + Q250 50R1 t#e random Guctuation signi8cantl* distort t#e
underl*ing function1 resulting in t#e regression line being more
#ori:ontal and ending wit# a gradient %uite different -J0.250 from
t#e gradient of t#e original function -0.40.
A *ote on te Close' 2orm 3egression Line
Bor t#e gi(en !roblem it would be !ossible to calculate t#e best28t regression line in closed
form instead of using an iterati(e !rocess. As would be e.!ected1 t#e resulting closed form
regression line was alwa*s a better 8t1 in terms of minimising least mean s%uared error1
t#an t#e iterati(e a!!roac#es. )owe(er1 a little sur!risingl*1 o(er 1000 test runs wit# +
0.00001 for t#e iterati(e !rocess1 t#e difference in mean of a(erage s%uared errors was
%uite small. 5#e mean of t#e a(erage s%uared errors for t#e closed form a!!roac# was
1;.7471 and t#e mean of t#e a(erage s%uared errors for t#e iterati(e a!!roac# was 1;.74;;.
=ut of interest 3 increased t#e (alue of to + 0.0011 and obser(ed t#e mean of t#e
a(erage s%uared error o(er 1000 test runs again1 resulting in closed form error + 1;.7;0;
and iterati(e error + 1;.4;F;.
Illustration of -at te *et-or. is 'oing
lot 2; s#ows t#e data !oints1 t#e underl*ing original function1 t#e closed form regression
line and t#e regression line retrie(ed from t#e network. lot 2F s#ows t#e data !oints ?ust
wit# t#e regression line retrie(ed from t#e network. 5o furt#er illustrate t#e learning
!rocess1 lots 2H272 s#ow t#e training !rogress w#en t#e network weig#ts #a(e been
randoml* selected -to better illustrate learning !rogress0 and lot 77 s#ows t#e
corres!onding error rate. Note t#at t#e algorit#m con(erged after 11 e!oc#s and t#e lots
s#ow t#e line 8tting !rogress after e!oc#s 0221 ; and 11 res!ecti(el*. 5#e !lots inbetween
#a(e been omitted for s!ace reasons.
15 of 20
lot 2;: All in one !lot1 containing t#e original function1 t#e
regression line learnt b* t#e Network and t#e regression line
obtained in closed form.
lot 2F: lot ?ust containing t#e regression line learnt b* t#e
Network.
1; of 20
lot 2H: $tate of t#e Network before t#e start of learning. lot 29: >egression line after t#e 8rst training e!oc#.
lot 70:
lot 71: >egression line after t#e ;
t#
training e!oc#.
lot 72: >egression line wit# con(erged Network !arameters after
11 training e!oc#s.
lot 70: >egression line after t#e 2
nd
training e!oc#.
lot 77: Error rate of t#e Network.
Part , (")
Intro'uctory *otes
Bor t#is task t#e dimensionalit* of t#e in!ut s!ace is increased to 2. 5#e in!uts are
inde!endent of eac# ot#er. erforming regression for t#is function results in a regression
!lane.
After #a(ing a!!reciated t#e bene8ts of !re!rocessing in t#e !re(ious !art1 3 onl*
e.!erimented wit# normalised data for t#is section. Normalisation was !erformed t#e
same wa* as in art 6 -10 -see Bormulas 1 L 20.
Bor weig#t initialisation 3 followed m* !re(ious a!!roac# of initialising t#e weig#t (ector
to be 1 for t#e bias weig#t and t#e gradients of t#e res!ecti(e terms of t#e functions
ot#erwise1 so for t#e gi(en function * + 0.4.
1
, 1.4.
2
, 1 t#e initial weig#t (ector was w
5
+
-w
0
1 w
11
w
2
0 + -11 0.41 1.40. 3 also initialised t#e bias to mean-0 as !re(iousl*. 3 furt#er used
t#e same error function and weig#t u!date rule and also !erformed s#ufGing before eac#
training e!oc#. 5able ; summarises t#e basic setu! for t#is task.
3 also used $e%uential &radient 'escent in con?unction wit# Aeast Iean $%uares as error
function for learning t#e network !arameters1 for t#e same reasons as stated in t#e
!re(ious section.
Parameter Setting
0.0001
bias 22 , mean-0
Error function Aeast Iean $%uared Error
3nitial "eig#ts w
5
+ -w
0
1 w
11
w
2
0 + -11 0.41 1.40
/n te -eigt an' bias !alue
As 3 e.!ected1 w
1
and w
2
con(erged towards 0.4 and 1.4 res!ecti(el*1 #owe(er t#is time t#e
bias weig#t was a lot more (olatile1 con(erging towards 1 in some e.!eriments and
con(erging towards 0 in ot#ers. 5#is was a bit sur!rising at 8rst1 but w#en 3 started
!rinting t#e mean (alue of t#e random Guctuations1 mean-01 alongside t#e weig#ts 3
found t#at t#e closer mean-0 was to 01 t#e closer w
0
was to 1 -i.e. mean-0 + 20.1594 w
0
+
0.99901 and t#e furt#er mean-0 was awa* from 01 t#e closer w
0
con(erged towards 0 -i.e.
mean-0 + 10.F2594 w0 + 0.00190. 3n bot# cases t#e resulting (alue for t#e interce!t would
be in t#e inter(al Q21 ,1R.
5#e (ariance in t#e bias led me to run some more e.!eriments and 3 found t#at w
1
and w
2
actuall* donCt con(erge towards 0.4 and 1.4 at allT 5#e ke* was (ar*ing t#e (alue of 1
w#ere t#e learnt weig#ts c#anged %uite signi8cantl*. At + 0.01 t#e (alues were as
re!orted abo(e1 w#en decreasing t#e (alue to + 0.00001 w
1
con(erged towards J0.21 so
1F of 20
5able ;: 6asic 3nitialisation of t#e free !arameters
#alf t#e gradient (alue and w
2
con(erged towards J0.F1 also roug#l* #alf t#e gradient
(alue. 5#is be#a(iour seems to be con8rmed b* t#e (alues obtained from a closed form
solution. Also interestingl*1 t#e (alue for w
0
alwa*s con(erged towards 0 in t#e closed
form a!!roac#. 5able F summarises t#e 8ndings of t#e !re(ious 2 !aragra!#s.
E4periment $eigt Close' 2orm *et-or. (5 6(61) *et-or. (5 6(66661) mean()
1
w
0
0 0.995H 0.1091 20.4F0;
w
1
0.2H11 0.79FF 0.2H11 20.4F0;
w
2
0.F75F 1.7HF 0.F7F; 20.4F0;
2
w
0
0 0.0052 20.001; F.401;
w
1
0.2H7H 0.790; 0.29F; F.401;
w
2
0.F;17 1.740; 0.H7;; F.401;
1o'ifying te 2unction
6* c#anging t#e function to * + 21.2.
1
, 0.;.
2
, 0 3 could (alidate m* assum!tions t#at
con(ergence of t#e weig#ts was rat#er roug#l* towards gradient 9 2 and de8nitel* not
towards t#e gradient (alue1 as 5able H below s#ows.
E4periment $eigt Close' 2orm *et-or. (5 6(61) *et-or. (5 6(66661) mean()
1
w
0
0 0.999 0.;00H 20.225F
w
1
20.;997 21.1902 20.;994 20.225F
w
2
0.4715 0.59;F 0.4715 20.225F
2
w
0
0 0.002; 0.001 211.1F9;
w
1
20.F145 21.1FH1 20.FF52 211.1F9;
w
2
0.2505 0.5F94 0.294H 211.1F9;
1H of 20
5able F: 3nter!retation of learnt weig#ts.
$at te *et-or. is 'oing
lots 7427F s#ow t#at t#e network is tr*ing to 8nd t#e best 8t !lane for t#e gi(en data.
19 of 20
lot 74: 'is!la*ing t#e best28t regression line1 learnt
b* t#e Network1 for t#e gi(en data !oints.
lot 75: 'is!la*ing t#e regression line obtained in
closed form.
lot 7;: 'is!la*ing t#e regression line obtained from
t#e Network as well as t#e regression line obtained
in closed form.
lot 7F: Error rate of t#e Network.
Appen'i4
5#e Iatlab code for t#is assignment is contained on t#e following !ages.
20 of 20

Perceptrons For Regression & Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Perceptrons For Regression & Classification

Uploaded by

Copyright:

Available Formats

Neural Networks Assignment 1

Candidate Number: 19214

You might also like