Neural Networks and Deep Learning Chap 4

6/2/2016
Neuralnetworksanddeeplearning
CHAPTER4
Avisualproofthatneuralnetscancomputeanyfunction
Oneofthemoststrikingfactsaboutneuralnetworksisthatthey
cancomputeanyfunctionatall.Thatis,supposesomeonehands
yousomecomplicated,wigglyfunction,f(x) :
NeuralNetworksandDeepLearning
Whatthisbookisabout
Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Improvingthewayneural
networkslearn
Avisualproofthatneuralnetscan
computeanyfunction
Whyaredeepneuralnetworks
hardtotrain?
Deeplearning
Appendix:Isthereasimple
algorithmforintelligence?
Acknowledgements
FrequentlyAskedQuestions
Nomatterwhatthefunction,thereisguaranteedtobeaneural
networksothatforeverypossibleinput,x,thevaluef(x) (orsome
Ifyoubenefitfromthebook,please
makeasmalldonation.Isuggest$3,
butyoucanchoosetheamount.
closeapproximation)isoutputfromthenetwork,e.g.:
Sponsors
Thisresultholdsevenifthefunctionhasmanyinputs,
f = f(x 1 , , x m )
,andmanyoutputs.Forinstance,here'sa
networkcomputingafunctionwithm = 3inputsandn = 2
outputs:
Thankstoallthesupporterswho
madethebookpossible,with
especialthankstoPavelDudrenov.
Thanksalsotoallthecontributorsto
theBugfinderHallofFame.
Resources
BookFAQ
Coderepository
MichaelNielsen'sproject
announcementmailinglist
DeepLearning,draftbookin
preparation,byYoshuaBengio,Ian
http://neuralnetworksanddeeplearning.com/chap4.html
1/30
6/2/2016
Goodfellow,andAaronCourville
ByMichaelNielsen/Jan2016
Thisresulttellsusthatneuralnetworkshaveakindofuniversality.
Nomatterwhatfunctionwewanttocompute,weknowthatthereis
aneuralnetworkwhichcandothejob.
What'smore,thisuniversalitytheoremholdsevenifwerestrictour
networkstohavejustasinglelayerintermediatebetweentheinput
andtheoutputneuronsasocalledsinglehiddenlayer.Soeven
verysimplenetworkarchitecturescanbeextremelypowerful.
Theuniversalitytheoremiswellknownbypeoplewhouseneural
networks.Butwhyit'strueisnotsowidelyunderstood.Mostofthe
explanationsavailablearequitetechnical.Forinstance,oneofthe
originalpapersprovingtheresult*didsousingtheHahnBanach
*Approximationbysuperpositionsofa
theorem,theRieszRepresentationtheorem,andsomeFourier
Theresultwasverymuchintheairatthetime,
analysis.Ifyou'reamathematiciantheargumentisnotdifficultto
sigmoidalfunction,byGeorgeCybenko(1989).
andseveralgroupsprovedcloselyrelatedresults.
Cybenko'spapercontainsausefuldiscussionof
follow,butit'snotsoeasyformostpeople.That'sapity,sincethe
muchofthatwork.Anotherimportantearly
underlyingreasonsforuniversalityaresimpleandbeautiful.
universalapproximators,byKurtHornik,
InthischapterIgiveasimpleandmostlyvisualexplanationofthe
paperisMultilayerfeedforwardnetworksare
MaxwellStinchcombe,andHalbertWhite
(1989).ThispaperusestheStoneWeierstrass
theoremtoarriveatsimilarresults.
universalitytheorem.We'llgostepbystepthroughtheunderlying
ideas.You'llunderstandwhyit'struethatneuralnetworkscan
computeanyfunction.You'llunderstandsomeofthelimitationsof
theresult.Andyou'llunderstandhowtheresultrelatestodeep
neuralnetworks.
Tofollowthematerialinthechapter,youdonotneedtohaveread
earlierchaptersinthisbook.Instead,thechapterisstructuredtobe
enjoyableasaselfcontainedessay.Providedyouhavejustalittle
basicfamiliaritywithneuralnetworks,youshouldbeabletofollow
theexplanation.Iwill,however,provideoccasionallinkstoearlier
2/30
6/2/2016
material,tohelpfillinanygapsinyourknowledge.
Universalitytheoremsareacommonplaceincomputerscience,so
muchsothatwesometimesforgethowastonishingtheyare.Butit's
worthremindingourselves:theabilitytocomputeanarbitrary
functionistrulyremarkable.Almostanyprocessyoucanimagine
canbethoughtofasfunctioncomputation.Considertheproblemof
namingapieceofmusicbasedonashortsampleofthepiece.That
canbethoughtofascomputingafunction.Orconsidertheproblem
oftranslatingaChinesetextintoEnglish.Again,thatcanbe
thoughtofascomputingafunction*.Orconsidertheproblemof
*Actually,computingoneofmanyfunctions,
takinganmp4moviefileandgeneratingadescriptionoftheplotof
translationsofagivenpieceoftext.
sincethereareoftenmanyacceptable
themovie,andadiscussionofthequalityoftheacting.Again,that
canbethoughtofasakindoffunctioncomputation*.Universality
*Dittotheremarkabouttranslationandthere
beingmanypossiblefunctions.
meansthat,inprinciple,neuralnetworkscandoallthesethings
andmanymore.
Ofcourse,justbecauseweknowaneuralnetworkexiststhatcan
(say)translateChinesetextintoEnglish,thatdoesn'tmeanwehave
goodtechniquesforconstructingorevenrecognizingsucha
network.Thislimitationappliesalsototraditionaluniversality
theoremsformodelssuchasBooleancircuits.But,aswe'veseen
earlierinthebook,neuralnetworkshavepowerfulalgorithmsfor
learningfunctions.Thatcombinationoflearningalgorithms+
universalityisanattractivemix.Uptonow,thebookhasfocused
onthelearningalgorithms.Inthischapter,wefocuson
universality,andwhatitmeans.
Twocaveats
Beforeexplainingwhytheuniversalitytheoremistrue,Iwantto
mentiontwocaveatstotheinformalstatement"aneuralnetwork
cancomputeanyfunction".
First,thisdoesn'tmeanthatanetworkcanbeusedtoexactly
computeanyfunction.Rather,wecangetanapproximationthatis
asgoodaswewant.Byincreasingthenumberofhiddenneurons
wecanimprovetheapproximation.Forinstance,earlierI
illustratedanetworkcomputingsomefunctionf(x) usingthree
hiddenneurons.Formostfunctionsonlyalowquality
approximationwillbepossibleusingthreehiddenneurons.By
3/30
6/2/2016
increasingthenumberofhiddenneurons(say,tofive)wecan
typicallygetabetterapproximation:
Andwecandostillbetterbyfurtherincreasingthenumberof
hiddenneurons.
Tomakethisstatementmoreprecise,supposewe'regivena
functionf(x) whichwe'dliketocomputetowithinsomedesired
accuracy > 0 .Theguaranteeisthatbyusingenoughhidden
neuronswecanalwaysfindaneuralnetworkwhoseoutputg(x)
satisfies|g(x) f(x)| < ,forallinputsx.Inotherwords,the
approximationwillbegoodtowithinthedesiredaccuracyforevery
possibleinput.
Thesecondcaveatisthattheclassoffunctionswhichcanbe
approximatedinthewaydescribedarethecontinuousfunctions.If
afunctionisdiscontinuous,i.e.,makessudden,sharpjumps,thenit
won'tingeneralbepossibletoapproximateusinganeuralnet.This
isnotsurprising,sinceourneuralnetworkscomputecontinuous
functionsoftheirinput.However,evenifthefunctionwe'dreally
liketocomputeisdiscontinuous,it'softenthecasethata
continuousapproximationisgoodenough.Ifthat'sso,thenwecan
useaneuralnetwork.Inpractice,thisisnotusuallyanimportant
limitation.
Summingup,amoreprecisestatementoftheuniversalitytheorem
isthatneuralnetworkswithasinglehiddenlayercanbeusedto
approximateanycontinuousfunctiontoanydesiredprecision.In
thischapterwe'llactuallyproveaslightlyweakerversionofthis
4/30
6/2/2016
result,usingtwohiddenlayersinsteadofone.IntheproblemsI'll
brieflyoutlinehowtheexplanationcan,withafewtweaks,be
adaptedtogiveaproofwhichusesonlyasinglehiddenlayer.
Universalitywithoneinputandone
output
Tounderstandwhytheuniversalitytheoremistrue,let'sstartby
understandinghowtoconstructaneuralnetworkwhich
approximatesafunctionwithjustoneinputandoneoutput:
Itturnsoutthatthisisthecoreoftheproblemofuniversality.Once
we'veunderstoodthisspecialcaseit'sactuallyprettyeasytoextend
tofunctionswithmanyinputsandmanyoutputs.
Tobuildinsightintohowtoconstructanetworktocomputef ,let's
startwithanetworkcontainingjustasinglehiddenlayer,withtwo
hiddenneurons,andanoutputlayercontainingasingleoutput
neuron:
Togetafeelforhowcomponentsinthenetworkwork,let'sfocuson
thetophiddenneuron.Inthediagrambelow,clickontheweight,
w
,anddragthemousealittlewaystotherighttoincreasew .You
canimmediatelyseehowthefunctioncomputedbythetophidden
5/30
6/2/2016
neuronchanges:
Aswelearntearlierinthebook,what'sbeingcomputedbythe
hiddenneuronis(wx + b) ,where(z) 1/(1 + e
isthe
sigmoidfunction.Uptonow,we'vemadefrequentuseofthis
algebraicform.Butfortheproofofuniversalitywewillobtainmore
insightbyignoringthealgebraentirely,andinsteadmanipulating
andobservingtheshapeshowninthegraph.Thiswon'tjustgiveus
abetterfeelforwhat'sgoingon,itwillalsogiveusaproof*of
universalitythatappliestoactivationfunctionsotherthanthe
*Strictlyspeaking,thevisualapproachI'm
sigmoidfunction.
proof.ButIbelievethevisualapproachgives
Togetstartedonthisproof,tryclickingonthebias,b,inthe
diagramabove,anddraggingtotherighttoincreaseit.You'llsee
takingisn'twhat'straditionallythoughtofasa
moreinsightintowhytheresultistruethana
traditionalproof.And,ofcourse,thatkindof
insightistherealpurposebehindaproof.
Occasionally,therewillbesmallgapsinthe
reasoningIpresent:placeswhereImakea
thatasthebiasincreasesthegraphmovestotheleft,butitsshape
visualargumentthatisplausible,butnotquite
doesn'tchange.
challengetofillinthemissingsteps.Butdon't
rigorous.Ifthisbothersyou,thenconsiderita
losesightoftherealpurpose:tounderstandwhy
Next,clickanddragtotheleftinordertodecreasethebias.You'll
theuniversalitytheoremistrue.
seethatasthebiasdecreasesthegraphmovestotheright,but,
again,itsshapedoesn'tchange.
Next,decreasetheweighttoaround2or3.You'llseethatasyou
decreasetheweight,thecurvebroadensout.Youmightneedto
changethebiasaswell,inordertokeepthecurveinframe.
Finally,increasetheweightuppastw = 100 .Asyoudo,thecurve
getssteeper,untileventuallyitbeginstolooklikeastepfunction.
Trytoadjustthebiassothestepoccursnearx = 0.3 .Thefollowing
shortclipshowswhatyourresultshouldlooklike.Clickontheplay
buttontoplay(orreplay)thevideo:
6/30
6/2/2016
Wecansimplifyouranalysisquiteabitbyincreasingtheweightso
muchthattheoutputreallyisastepfunction,toaverygood
approximation.BelowI'veplottedtheoutputfromthetophidden
neuronwhentheweightisw = 999 .Notethatthisplotisstatic,and
youcan'tchangeparameterssuchastheweight.
It'sactuallyquiteabiteasiertoworkwithstepfunctionsthan
generalsigmoidfunctions.Thereasonisthatintheoutputlayerwe
addupcontributionsfromallthehiddenneurons.It'seasyto
analyzethesumofabunchofstepfunctions,butrathermore
difficulttoreasonaboutwhathappenswhenyouaddupabunchof
sigmoidshapedcurves.Andsoitmakesthingsmucheasierto
assumethatourhiddenneuronsareoutputtingstepfunctions.
Moreconcretely,wedothisbyfixingtheweightw tobesomevery
largevalue,andthensettingthepositionofthestepbymodifying
thebias.Ofcourse,treatingtheoutputasastepfunctionisan
approximation,butit'saverygoodapproximation,andfornow
we'lltreatitasexact.I'llcomebacklatertodiscusstheimpactof
deviationsfromthisapproximation.
Atwhatvalueofxdoesthestepoccur?Putanotherway,howdoes
thepositionofthestepdependupontheweightandbias?
7/30
6/2/2016
Toanswerthisquestion,trymodifyingtheweightandbiasinthe
diagramabove(youmayneedtoscrollbackabit).Canyoufigure
outhowthepositionofthestepdependsonw andb?Withalittle
workyoushouldbeabletoconvinceyourselfthatthepositionofthe
stepisproportionaltob,andinverselyproportionaltow .
Infact,thestepisatpositions = b/w,asyoucanseeby
modifyingtheweightandbiasinthefollowingdiagram:
Itwillgreatlysimplifyourlivestodescribehiddenneuronsusing
justasingleparameter,s,whichisthestepposition,s = b/w.Try
modifyingsinthefollowingdiagram,inordertogetusedtothe
newparameterization:
Asnotedabove,we'veimplicitlysettheweightw ontheinputtobe
somelargevaluebigenoughthatthestepfunctionisaverygood
approximation.Wecaneasilyconvertaneuronparameterizedin
thiswaybackintotheconventionalmodel,bychoosingthebias
b = ws
Uptonowwe'vebeenfocusingontheoutputfromjustthetop
hiddenneuron.Let'stakealookatthebehavioroftheentire
network.Inparticular,we'llsupposethehiddenneuronsare
8/30
6/2/2016
computingstepfunctionsparameterizedbysteppointss (top
1
neuron)ands (bottomneuron).Andthey'llhaverespectiveoutput
2
weightsw andw .Here'sthenetwork:

1
What'sbeingplottedontherightistheweightedoutput
w 1 a1 + w 2 a2
fromthehiddenlayer.Here,a anda aretheoutputs

1
fromthetopandbottomhiddenneurons,respectively*.These
*Note,bytheway,thattheoutputfromthe
outputsaredenotedwithasbecausethey'reoftenknownasthe
thebiasontheoutputneuron.Obviously,this
wholenetworkis(w
1 a1
,whereb is
+ w 2 a2 + b)
isn'tthesameastheweightedoutputfromthe
neurons'activations.
hiddenlayer,whichiswhatwe'replottinghere.
We'regoingtofocusontheweightedoutput
Tryincreasinganddecreasingthesteppoints ofthetophidden
fromthehiddenlayerrightnow,andonlylater
neuron.Getafeelforhowthischangestheweightedoutputfrom
fromthewholenetwork.
willwethinkabouthowthatrelatestotheoutput
thehiddenlayer.It'sparticularlyworthunderstandingwhat
happenswhens goespasts .You'llseethatthegraphchanges
1
shapewhenthishappens,sincewehavemovedfromasituation
wherethetophiddenneuronisthefirsttobeactivatedtoa
situationwherethebottomhiddenneuronisthefirsttobe
activated.
Similarly,trymanipulatingthesteppoints ofthebottomhidden
2
neuron,andgetafeelforhowthischangesthecombinedoutput
fromthehiddenneurons.
Tryincreasinganddecreasingeachoftheoutputweights.Notice
howthisrescalesthecontributionfromtherespectivehidden
neurons.Whathappenswhenoneoftheweightsiszero?
Finally,trysettingw tobe0.8 andw tobe0.8.Yougeta"bump"
1
function,whichstartsatpoints ,endsatpoints ,andhasheight

1
0.8
.Forinstance,theweightedoutputmightlooklikethis:
9/30
6/2/2016
Ofcourse,wecanrescalethebumptohaveanyheightatall.Let's
useasingleparameter,h,todenotetheheight.ToreduceclutterI'll
alsoremovethe"s
"and"w
"notations.
Trychangingthevalueofhupanddown,toseehowtheheightof
thebumpchanges.Trychangingtheheightsoit'snegative,and
observewhathappens.Andtrychangingthesteppointstoseehow
thatchangestheshapeofthebump.
You'llnotice,bytheway,thatwe'reusingourneuronsinawaythat
canbethoughtofnotjustingraphicalterms,butinmore
conventionalprogrammingterms,asakindofifthenelse
statement,e.g.:
ifinput>=steppoint:
add1totheweightedoutput
else:
add0totheweightedoutput
ForthemostpartI'mgoingtostickwiththegraphicalpointofview.
Butinwhatfollowsyoumaysometimesfindithelpfultoswitch
pointsofview,andthinkaboutthingsintermsofifthenelse.
Wecanuseourbumpmakingtricktogettwobumps,bygluingtwo
10/30
6/2/2016
pairsofhiddenneuronstogetherintothesamenetwork:
I'vesuppressedtheweightshere,simplywritingthehvaluesfor
eachpairofhiddenneurons.Tryincreasinganddecreasingbothh
values,andobservehowitchangesthegraph.Movethebumps
aroundbychangingthesteppoints.
Moregenerally,wecanusethisideatogetasmanypeaksaswe
want,ofanyheight.Inparticular,wecandividetheinterval[0, 1] up
intoalargenumber,N ,ofsubintervals,anduseN pairsofhidden
neuronstosetuppeaksofanydesiredheight.Let'sseehowthis
worksforN
= 5
.That'squiteafewneurons,soI'mgoingtopack
thingsinabit.Apologiesforthecomplexityofthediagram:Icould
hidethecomplexitybyabstractingawayfurther,butIthinkit's
worthputtingupwithalittlecomplexity,forthesakeofgettinga
moreconcretefeelforhowthesenetworkswork.
11/30
6/2/2016
Youcanseethattherearefivepairsofhiddenneurons.Thestep
pointsfortherespectivepairsofneuronsare0, 1/5,then1/5, 2/5 ,
andsoon,outto4/5, 5/5 .Thesevaluesarefixedtheymakeitso
wegetfiveevenlyspacedbumpsonthegraph.
Eachpairofneuronshasavalueofhassociatedtoit.Remember,
theconnectionsoutputfromtheneuronshaveweightshandh
(notmarked).Clickononeofthehvalues,anddragthemouseto
therightorlefttochangethevalue.Asyoudoso,watchthe
functionchange.Bychangingtheoutputweightswe'reactually
designingthefunction!
Contrariwise,tryclickingonthegraph,anddraggingupordownto
changetheheightofanyofthebumpfunctions.Asyouchangethe
heights,youcanseethecorrespondingchangeinhvalues.And,
althoughit'snotshown,thereisalsoachangeinthecorresponding
outputweights,whichare+h andh .
Inotherwords,wecandirectlymanipulatethefunctionappearing
inthegraphontheright,andseethatreflectedinthehvalueson
theleft.Afunthingtodoistoholdthemousebuttondownand
dragthemousefromonesideofthegraphtotheother.Asyoudo
12/30
6/2/2016
thisyoudrawoutafunction,andgettowatchtheparametersinthe
neuralnetworkadapt.
Timeforachallenge.
Let'sthinkbacktothefunctionIplottedatthebeginningofthe
chapter:
Ididn'tsayitatthetime,butwhatIplottedisactuallythefunction
f(x) = 0.2 + 0.4x
+ 0.3x sin(15x) + 0.05 cos(50x),
(113)
plottedoverxfrom0to1,andwiththey axistakingvaluesfrom0
to1.
That'sobviouslynotatrivialfunction.
You'regoingtofigureouthowtocomputeitusinganeural
network.
Inournetworksabovewe'vebeenanalyzingtheweighted
combination
w j aj
outputfromthehiddenneurons.Wenow
knowhowtogetalotofcontroloverthisquantity.But,asInoted
earlier,thisquantityisnotwhat'soutputfromthenetwork.What's
outputfromthenetworkis(
w j aj + b)
wherebisthebiason
theoutputneuron.Istheresomewaywecanachievecontrolover
theactualoutputfromthenetwork?
Thesolutionistodesignaneuralnetworkwhosehiddenlayerhasa
weightedoutputgivenby
f(x)
,where
isjusttheinverse
ofthefunction.Thatis,wewanttheweightedoutputfromthe
hiddenlayertobe:
13/30
6/2/2016
Ifwecandothis,thentheoutputfromthenetworkasawholewill
beagoodapproximationtof(x) *.
*NotethatIhavesetthebiasontheoutput
neuronto0 .
Yourchallenge,then,istodesignaneuralnetworktoapproximate
thegoalfunctionshownjustabove.Tolearnasmuchaspossible,I
wantyoutosolvetheproblemtwice.Thefirsttime,pleaseclickon
thegraph,directlyadjustingtheheightsofthedifferentbump
functions.Youshouldfinditfairlyeasytogetagoodmatchtothe
goalfunction.Howwellyou'redoingismeasuredbytheaverage
deviationbetweenthegoalfunctionandthefunctionthenetworkis
actuallycomputing.Yourchallengeistodrivetheaveragedeviation
aslowaspossible.Youcompletethechallengewhenyoudrivethe
averagedeviationto0.40 orbelow.
Onceyou'vedonethat,clickon"Reset"torandomlyreinitializethe
bumps.Thesecondtimeyousolvetheproblem,resisttheurgeto
clickonthegraph.Instead,modifythehvaluesonthelefthand
side,andagainattempttodrivetheaveragedeviationto0.40 or
below.
14/30
6/2/2016
You'venowfiguredoutalltheelementsnecessaryforthenetwork
toapproximatelycomputethefunctionf(x) !It'sonlyacoarse
approximation,butwecouldeasilydomuchbetter,merelyby
increasingthenumberofpairsofhiddenneurons,allowingmore
bumps.
Inparticular,it'seasytoconvertallthedatawehavefoundback
intothestandardparameterizationusedforneuralnetworks.Let
mejustrecapquicklyhowthatworks.
Thefirstlayerofweightsallhavesomelarge,constantvalue,say
w = 1000
Thebiasesonthehiddenneuronsarejustb = ws .So,for
instance,forthesecondhiddenneurons = 0.2 becomes
b = 1000 0.2 = 200
Thefinallayerofweightsaredeterminedbythehvalues.So,for
instance,thevalueyou'vechosenaboveforthefirsth,h = 1.3,
meansthattheoutputweightsfromthetoptwohiddenneuronsare
and-1.3,respectively.Andsoon,fortheentirelayerofoutput
1.3
weights.
15/30
6/2/2016
Finally,thebiasontheoutputneuronis0.
That'severything:wenowhaveacompletedescriptionofaneural
networkwhichdoesaprettygoodjobcomputingouroriginalgoal
function.Andweunderstandhowtoimprovethequalityofthe
approximationbyimprovingthenumberofhiddenneurons.
What'smore,therewasnothingspecialaboutouroriginalgoal
function,f(x) = 0.2 + 0.4x
+ 0.3 sin(15x) + 0.05 cos(50x)
.We
couldhaveusedthisprocedureforanycontinuousfunctionfrom
[0, 1]
to[0, 1] .Inessence,we'reusingoursinglelayerneural
networkstobuildalookuptableforthefunction.Andwe'llbeable
tobuildonthisideatoprovideageneralproofofuniversality.
Manyinputvariables
Let'sextendourresultstothecaseofmanyinputvariables.This
soundscomplicated,butalltheideasweneedcanbeunderstoodin
thecaseofjusttwoinputs.Solet'saddressthetwoinputcase.
We'llstartbyconsideringwhathappenswhenwehavetwoinputs
toaneuron:
Here,wehaveinputsxandy ,withcorrespondingweightsw and

1
w2
,andabiasbontheneuron.Let'ssettheweightw to0,and
2
thenplayaroundwiththefirstweight,w ,andthebias,b,tosee
1
howtheyaffecttheoutputfromtheneuron:
Output
=1
=1
16/30
6/2/2016
Asyoucansee,withw
= 0
theinputy makesnodifferencetothe
outputfromtheneuron.It'sasthoughxistheonlyinput.
Giventhis,whatdoyouthinkhappenswhenweincreasetheweight
w1
tow
= 100
,withw remaining0?Ifyoudon'timmediatelysee
2
theanswer,ponderthequestionforabit,andseeifyoucanfigure
outwhathappens.Thentryitoutandseeifyou'reright.I'veshown
whathappensinthefollowingmovie:
Justasinourearlierdiscussion,astheinputweightgetslargerthe
outputapproachesastepfunction.Thedifferenceisthatnowthe
stepfunctionisinthreedimensions.Alsoasbefore,wecanmove
thelocationofthesteppointaroundbymodifyingthebias.The
actuallocationofthesteppointiss
b/w 1
Let'sredotheaboveusingthepositionofthestepastheparameter:
Output
=1
=1
Here,weassumetheweightonthexinputhassomelargevalue
I'veusedw
= 1000
andtheweightw
= 0
.Thenumberonthe
neuronisthesteppoint,andthelittlexabovethenumberreminds
usthatthestepisinthexdirection.Ofcourse,it'salsopossibleto
getastepfunctioninthey direction,bymakingtheweightonthey
inputverylarge(say,w
0
,i.e.,w
= 0
= 1000
),andtheweightonthexequalto
17/30
6/2/2016
Output
=1
=1
Thenumberontheneuronisagainthesteppoint,andinthiscase
thelittley abovethenumberremindsusthatthestepisinthey
direction.Icouldhaveexplicitlymarkedtheweightsonthexandy
inputs,butdecidednotto,sinceitwouldmakethediagramrather
cluttered.Butdokeepinmindthatthelittley markerimplicitly
tellsusthatthey weightislarge,andthexweightis0.
Wecanusethestepfunctionswe'vejustconstructedtocomputea
threedimensionalbumpfunction.Todothis,weusetwoneurons,
eachcomputingastepfunctioninthexdirection.Thenwecombine
thosestepfunctionswithweighthandh ,respectively,wherehis
thedesiredheightofthebump.It'sallillustratedinthefollowing
diagram:
Weightedoutputfromhiddenlayer
=1
=1
Trychangingthevalueoftheheight,h.Observehowitrelatestothe
weightsinthenetwork.Andseehowitchangestheheightofthe
bumpfunctionontheright.
Also,trychangingthesteppoint0.30 associatedtothetophidden
neuron.Witnesshowitchangestheshapeofthebump.What
happenswhenyoumoveitpastthesteppoint0.70 associatedtothe
bottomhiddenneuron?
We'vefiguredouthowtomakeabumpfunctioninthexdirection.
Ofcourse,wecaneasilymakeabumpfunctioninthey direction,by
usingtwostepfunctionsinthey direction.Recallthatwedothisby
makingtheweightlargeonthey input,andtheweight0onthex
18/30
6/2/2016
input.Here'stheresult:
=1
=1
Thislooksnearlyidenticaltotheearliernetwork!Theonlything
explicitlyshownaschangingisthatthere'snowlittley markerson
ourhiddenneurons.Thatremindsusthatthey'reproducingy step
functions,notxstepfunctions,andsotheweightisverylargeon
they input,andzeroonthexinput,notviceversa.Asbefore,I
decidednottoshowthisexplicitly,inordertoavoidclutter.
Let'sconsiderwhathappenswhenweadduptwobumpfunctions,
oneinthexdirection,theotherinthey direction,bothofheighth:
=1
=1
TosimplifythediagramI'vedroppedtheconnectionswithzero
weight.Fornow,I'veleftinthelittlexandy markersonthehidden
neurons,toremindyouinwhatdirectionsthebumpfunctionsare
beingcomputed.We'lldropeventhosemarkerslater,sincethey're
impliedbytheinputvariable.
Tryvaryingtheparameterh.Asyoucansee,thiscausestheoutput
weightstochange,andalsotheheightsofboththexandy bump
functions.
Whatwe'vebuiltlooksalittlelikeatowerfunction:
Towerfunction
19/30
6/2/2016
=1
=1
Ifwecouldbuildsuchtowerfunctions,thenwecouldusethemto
approximatearbitraryfunctions,justbyaddingupmanytowersof
differentheights,andindifferentlocations:
Manytowers
=1
=1
Ofcourse,wehaven'tyetfiguredouthowtobuildatowerfunction.
Whatwehaveconstructedlookslikeacentraltower,ofheight2h ,
withasurroundingplateau,ofheighth.
Butwecanmakeatowerfunction.Rememberthatearlierwesaw
neuronscanbeusedtoimplementatypeofifthenelsestatement:
ifinput>=threshold:
output1
else:
output0
Thatwasforaneuronwithjustasingleinput.Whatwewantisto
applyasimilarideatothecombinedoutputfromthehidden
neurons:
ifcombinedoutputfromhiddenneurons>=threshold:
output1
else:
output0
Ifwechoosethethresholdappropriatelysay,avalueof3h/2 ,
whichissandwichedbetweentheheightoftheplateauandthe
heightofthecentraltowerwecouldsquashtheplateaudownto
zero,andleavejustthetowerstanding.
Canyouseehowtodothis?Tryexperimentingwiththefollowing
networktofigureitout.Notethatwe'renowplottingtheoutput
fromtheentirenetwork,notjusttheweightedoutputfromthe
hiddenlayer.Thismeansweaddabiastermtotheweightedoutput
20/30
6/2/2016
fromthehiddenlayer,andapplythesigmafunction.Canyoufind
valuesforhandbwhichproduceatower?Thisisabittricky,soif
youthinkaboutthisforawhileandremainstuck,here'stwohints:
(1)Togettheoutputneurontoshowtherightkindofifthenelse
behaviour,weneedtheinputweights(allhorh )tobelargeand
(2)thevalueofbdeterminesthescaleoftheifthenelsethreshold.
Output
=1
=1
Withourinitialparameters,theoutputlookslikeaflattened
versionoftheearlierdiagram,withitstowerandplateau.Togetthe
desiredbehaviour,weincreasetheparameterhuntilitbecomes
large.Thatgivestheifthenelsethresholdingbehaviour.Second,
togetthethresholdright,we'llchooseb 3h/2.Tryit,andsee
howitworks!
Here'swhatitlookslike,whenweuseh = 10 :
Evenforthisrelativelymodestvalueofh,wegetaprettygood
towerfunction.And,ofcourse,wecanmakeitasgoodaswewant
byincreasinghstillfurther,andkeepingthebiasasb = 3h/2.
Let'strygluingtwosuchnetworkstogether,inordertocompute
twodifferenttowerfunctions.Tomaketherespectiverolesofthe
twosubnetworksclearI'veputtheminseparateboxes,below:each
21/30
6/2/2016
boxcomputesatowerfunction,usingthetechniquedescribed
above.Thegraphontherightshowstheweightedoutputfromthe
secondhiddenlayer,thatis,it'saweightedcombinationoftower
functions.
Weightedoutput
=1
=1
Inparticular,youcanseethatbymodifyingtheweightsinthefinal
layeryoucanchangetheheightoftheoutputtowers.
Thesameideacanbeusedtocomputeasmanytowersaswelike.
Wecanalsomakethemasthinaswelike,andwhateverheightwe
like.Asaresult,wecanensurethattheweightedoutputfromthe
secondhiddenlayerapproximatesanydesiredfunctionoftwo
variables:
Manytowers
=1
=1
Inparticular,bymakingtheweightedoutputfromthesecond
hiddenlayeragoodapproximationto
,weensuretheoutput
22/30
6/2/2016
fromournetworkwillbeagoodapproximationtoanydesired
function,f .
Whataboutfunctionsofmorethantwovariables?
Let'strythreevariablesx
1,
x2 , x3
.Thefollowingnetworkcanbe
usedtocomputeatowerfunctioninfourdimensions:
Here,thex
1,
x2 , x3
denoteinputstothenetwork.Thes
1,
t1
andso
onaresteppointsforneuronsthatis,alltheweightsinthefirst
layerarelarge,andthebiasesaresettogivethesteppoints
s 1 , t1 , s 2 ,
.Theweightsinthesecondlayeralternate+h, h ,
wherehissomeverylargenumber.Andtheoutputbiasis5h/2 .
Thisnetworkcomputesafunctionwhichis1providedthree
conditionsaremet:x isbetweens andt x isbetweens and
1
t2
andx isbetweens andt .Thenetworkis0everywhereelse.

3
Thatis,it'sakindoftowerwhichis1inalittleregionofinput
space,and0everywhereelse.
Bygluingtogethermanysuchnetworkswecangetasmanytowers
aswewant,andsoapproximateanarbitraryfunctionofthree
variables.Exactlythesameideaworksinm dimensions.Theonly
changeneededistomaketheoutputbias(m + 1/2)h ,inorderto
gettherightkindofsandwichingbehaviortoleveltheplateau.
Okay,sowenowknowhowtouseneuralnetworkstoapproximate
arealvaluedfunctionofmanyvariables.Whataboutvectorvalued
functionsf(x
1,
, xm ) R
?Ofcourse,suchafunctioncanbe
23/30
6/2/2016
regardedasjustn separaterealvaluedfunctions,
f
(x 1 , , x m ), f
(x 1 , , x m )
,andsoon.Sowecreateanetwork
approximatingf ,anothernetworkforf ,andsoon.Andthenwe

1
simplyglueallthenetworkstogether.Sothat'salsoeasytocope
with.
Problem
We'veseenhowtousenetworkswithtwohiddenlayersto
approximateanarbitraryfunction.Canyoufindaproof
showingthatit'spossiblewithjustasinglehiddenlayer?Asa
hint,tryworkinginthecaseofjusttwoinputvariables,and
showingthat:(a)it'spossibletogetstepfunctionsnotjustin
thexory directions,butinanarbitrarydirection(b)by
addingupmanyoftheconstructionsfrompart(a)it'spossible
toapproximateatowerfunctionwhichiscircularinshape,
ratherthanrectangular(c)usingthesecirculartowers,it's
possibletoapproximateanarbitraryfunction.Todopart(c)it
mayhelptouseideasfromabitlaterinthischapter.
Extensionbeyondsigmoidneurons
We'veprovedthatnetworksmadeupofsigmoidneuronscan
computeanyfunction.Recallthatinasigmoidneurontheinputs
x1 , x2 ,
resultintheoutput(
w j x j + b)
,wherew arethe
j
weights,bisthebias,andisthesigmoidfunction:
Whatifweconsideradifferenttypeofneuron,oneusingsome
otheractivationfunction,s(z):
24/30
6/2/2016
Thatis,we'llassumethatifourneuronshasinputsx
weightsw
1,
w2 ,
andbiasb,thentheoutputiss(
1,
x2 ,
w j x j + b)
Wecanusethisactivationfunctiontogetastepfunction,justaswe
didwiththesigmoid.Tryrampinguptheweightinthefollowing,
saytow = 100 :
Justaswiththesigmoid,thiscausestheactivationfunctionto
contract,andultimatelyitbecomesaverygoodapproximationtoa
stepfunction.Trychangingthebias,andyou'llseethatwecanset
thepositionofthesteptobewhereverwechoose.Andsowecan
useallthesametricksasbeforetocomputeanydesiredfunction.
Whatpropertiesdoess(z)needtosatisfyinorderforthistowork?
Wedoneedtoassumethats(z)iswelldefinedasz
z
and
.Thesetwolimitsarethetwovaluestakenonbyourstep
function.Wealsoneedtoassumethattheselimitsaredifferent
fromoneanother.Iftheyweren't,there'dbenostep,simplyaflat
graph!Butprovidedtheactivationfunctions(z)satisfiesthese
properties,neuronsbasedonsuchanactivationfunctionare
universalforcomputation.
Problems
Earlierinthebookwemetanothertypeofneuronknownasa
rectifiedlinearunit.Explainwhysuchneuronsdon'tsatisfythe
conditionsjustgivenforuniversality.Findaproofof
universalityshowingthatrectifiedlinearunitsareuniversalfor
25/30
6/2/2016
computation.
Supposeweconsiderlinearneurons,i.e.,neuronswiththe
activationfunctions(z) = z .Explainwhylinearneuronsdon't
satisfytheconditionsjustgivenforuniversality.Showthat
suchneuronscan'tbeusedtodouniversalcomputation.
Fixingupthestepfunctions
Uptonow,we'vebeenassumingthatourneuronscanproducestep
functionsexactly.That'saprettygoodapproximation,butitisonly
anapproximation.Infact,therewillbeanarrowwindowoffailure,
illustratedinthefollowinggraph,inwhichthefunctionbehaves
verydifferentlyfromastepfunction:
InthesewindowsoffailuretheexplanationI'vegivenfor
universalitywillfail.
Now,it'snotaterriblefailure.Bymakingtheweightsinputtothe
neuronsbigenoughwecanmakethesewindowsoffailureassmall
aswelike.Certainly,wecanmakethewindowmuchnarrowerthan
I'veshownabovenarrower,indeed,thanoureyecouldsee.So
perhapswemightnotworrytoomuchaboutthisproblem.
Nonetheless,it'dbenicetohavesomewayofaddressingthe
problem.
Infact,theproblemturnsouttobeeasytofix.Let'slookatthefix
forneuralnetworkscomputingfunctionswithjustoneinputand
oneoutput.Thesameideasworkalsotoaddresstheproblemwhen
therearemoreinputsandoutputs.
Inparticular,supposewewantournetworktocomputesome
function,f .Asbefore,wedothisbytryingtodesignournetworkso
thattheweightedoutputfromourhiddenlayerofneuronsis
f(x)
26/30
6/2/2016
Ifweweretodothisusingthetechniquedescribedearlier,we'duse
thehiddenneuronstoproduceasequenceofbumpfunctions:
Again,I'veexaggeratedthesizeofthewindowsoffailure,inorder
tomakethemeasiertosee.Itshouldbeprettyclearthatifweadd
allthesebumpfunctionsupwe'llendupwithareasonable
approximationto
f(x)
,exceptwithinthewindowsoffailure.
Supposethatinsteadofusingtheapproximationjustdescribed,we
useasetofhiddenneuronstocomputeanapproximationtohalf
ouroriginalgoalfunction,i.e.,to
f(x)/2
.Ofcourse,thislooks
justlikeascaleddownversionofthelastgraph:
27/30
6/2/2016
Andsupposeweuseanothersetofhiddenneuronstocomputean
approximationto
f(x)/2
,butwiththebasesofthebumps
shiftedbyhalfthewidthofabump:
Nowwehavetwodifferentapproximationsto
f(x)/2
.Ifwe
addupthetwoapproximationswe'llgetanoverallapproximation
to
f(x)
.Thatoverallapproximationwillstillhavefailuresin
smallwindows.Buttheproblemwillbemuchlessthanbefore.The
reasonisthatpointsinafailurewindowforoneapproximation
won'tbeinafailurewindowfortheother.Andsothe
approximationwillbeafactorroughly2betterinthosewindows.
Wecoulddoevenbetterbyaddingupalargenumber,M ,of
overlappingapproximationstothefunction
f(x)/M
Providedthewindowsoffailurearenarrowenough,apointwill
onlyeverbeinonewindowoffailure.Andprovidedwe'reusinga
largeenoughnumberM ofoverlappingapproximations,theresult
willbeanexcellentoverallapproximation.
Conclusion
28/30
6/2/2016
Theexplanationforuniversalitywe'vediscussediscertainlynota
practicalprescriptionforhowtocomputeusingneuralnetworks!In
this,it'smuchlikeproofsofuniversalityforNANDgatesandthelike.
Forthisreason,I'vefocusedmostlyontryingtomakethe
constructionclearandeasytofollow,andnotonoptimizingthe
detailsoftheconstruction.However,youmayfinditafunand
instructiveexercisetoseeifyoucanimprovetheconstruction.
Althoughtheresultisn'tdirectlyusefulinconstructingnetworks,
it'simportantbecauseittakesoffthetablethequestionofwhether
anyparticularfunctioniscomputableusinganeuralnetwork.The
answertothatquestionisalways"yes".Sotherightquestiontoask
isnotwhetheranyparticularfunctioniscomputable,butrather
what'sagoodwaytocomputethefunction.
Theuniversalityconstructionwe'vedevelopedusesjusttwohidden
layerstocomputeanarbitraryfunction.Furthermore,aswe've
discussed,it'spossibletogetthesameresultwithjustasingle
hiddenlayer.Giventhis,youmightwonderwhywewouldeverbe
interestedindeepnetworks,i.e.,networkswithmanyhidden
layers.Can'twesimplyreplacethosenetworkswithshallow,single
hiddenlayernetworks?
Whileinprinciplethat'spossible,therearegoodpracticalreasons
Chapteracknowledgments:ThankstoJen
tousedeepnetworks.AsarguedinChapter1,deepnetworkshavea
universalityinneuralnetworks.Mythanks,in
hierarchicalstructurewhichmakesthemparticularlywelladapted
DoddandChrisOlahformanydiscussionsabout
particular,toChrisforsuggestingtheuseofa
lookuptabletoproveuniversality.The
tolearnthehierarchiesofknowledgethatseemtobeusefulin
interactivevisualformofthechapterisinspired
solvingrealworldproblems.Putmoreconcretely,whenattacking
Patel,BretVictor,andStevenWittens.
bytheworkofpeoplesuchasMikeBostock,Amit
problemssuchasimagerecognition,ithelpstouseasystemthat
understandsnotjustindividualpixels,butalsoincreasinglymore
complexconcepts:fromedgestosimplegeometricshapes,allthe
wayupthroughcomplex,multiobjectscenes.Inlaterchapters,
we'llseeevidencesuggestingthatdeepnetworksdoabetterjob
thanshallownetworksatlearningsuchhierarchiesofknowledge.
Tosumup:universalitytellsusthatneuralnetworkscancompute
anyfunctionandempiricalevidencesuggeststhatdeepnetworks
arethenetworksbestadaptedtolearnthefunctionsusefulin
solvingmanyrealworldproblems.
..
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",
Lastupdate:FriJan2214:09:502016
29/30
6/2/2016
DeterminationPress,2015
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.
30/30

Neural Networks and Deep Learning Chap 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks and Deep Learning Chap 4

Uploaded by

Copyright:

Available Formats

6/2/2016

weightsw andw .Here'sthenetwork:

fromthehiddenlayer.Here,a anda aretheoutputs

function,whichstartsatpoints ,endsatpoints ,andhasheight

+ 0.3x sin(15x) + 0.05 cos(50x),

+ 0.3 sin(15x) + 0.05 cos(50x)

Here,wehaveinputsxandy ,withcorrespondingweightsw and

andx isbetweens andt .Thenetworkis0everywhereelse.

approximatingf ,anothernetworkforf ,andsoon.Andthenwe

You might also like