You are on page 1of 30

6/2/2016

Neuralnetworksanddeeplearning

CHAPTER4

Avisualproofthatneuralnetscancomputeanyfunction
Oneofthemoststrikingfactsaboutneuralnetworksisthatthey
cancomputeanyfunctionatall.Thatis,supposesomeonehands
yousomecomplicated,wigglyfunction,f(x) :

NeuralNetworksandDeepLearning
Whatthisbookisabout
Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Improvingthewayneural
networkslearn
Avisualproofthatneuralnetscan
computeanyfunction
Whyaredeepneuralnetworks
hardtotrain?
Deeplearning
Appendix:Isthereasimple
algorithmforintelligence?
Acknowledgements
FrequentlyAskedQuestions

Nomatterwhatthefunction,thereisguaranteedtobeaneural
networksothatforeverypossibleinput,x,thevaluef(x) (orsome

Ifyoubenefitfromthebook,please
makeasmalldonation.Isuggest$3,
butyoucanchoosetheamount.

closeapproximation)isoutputfromthenetwork,e.g.:

Sponsors

Thisresultholdsevenifthefunctionhasmanyinputs,
f = f(x 1 , , x m )

,andmanyoutputs.Forinstance,here'sa

networkcomputingafunctionwithm = 3inputsandn = 2
outputs:

Thankstoallthesupporterswho
madethebookpossible,with
especialthankstoPavelDudrenov.
Thanksalsotoallthecontributorsto
theBugfinderHallofFame.

Resources
BookFAQ
Coderepository
MichaelNielsen'sproject
announcementmailinglist
DeepLearning,draftbookin
preparation,byYoshuaBengio,Ian
http://neuralnetworksanddeeplearning.com/chap4.html

1/30

6/2/2016

Neuralnetworksanddeeplearning

Goodfellow,andAaronCourville

ByMichaelNielsen/Jan2016

Thisresulttellsusthatneuralnetworkshaveakindofuniversality.
Nomatterwhatfunctionwewanttocompute,weknowthatthereis
aneuralnetworkwhichcandothejob.
What'smore,thisuniversalitytheoremholdsevenifwerestrictour
networkstohavejustasinglelayerintermediatebetweentheinput
andtheoutputneuronsasocalledsinglehiddenlayer.Soeven
verysimplenetworkarchitecturescanbeextremelypowerful.
Theuniversalitytheoremiswellknownbypeoplewhouseneural
networks.Butwhyit'strueisnotsowidelyunderstood.Mostofthe
explanationsavailablearequitetechnical.Forinstance,oneofthe
originalpapersprovingtheresult*didsousingtheHahnBanach

*Approximationbysuperpositionsofa

theorem,theRieszRepresentationtheorem,andsomeFourier

Theresultwasverymuchintheairatthetime,

analysis.Ifyou'reamathematiciantheargumentisnotdifficultto

sigmoidalfunction,byGeorgeCybenko(1989).
andseveralgroupsprovedcloselyrelatedresults.
Cybenko'spapercontainsausefuldiscussionof

follow,butit'snotsoeasyformostpeople.That'sapity,sincethe

muchofthatwork.Anotherimportantearly

underlyingreasonsforuniversalityaresimpleandbeautiful.

universalapproximators,byKurtHornik,

InthischapterIgiveasimpleandmostlyvisualexplanationofthe

paperisMultilayerfeedforwardnetworksare
MaxwellStinchcombe,andHalbertWhite
(1989).ThispaperusestheStoneWeierstrass
theoremtoarriveatsimilarresults.

universalitytheorem.We'llgostepbystepthroughtheunderlying
ideas.You'llunderstandwhyit'struethatneuralnetworkscan
computeanyfunction.You'llunderstandsomeofthelimitationsof
theresult.Andyou'llunderstandhowtheresultrelatestodeep
neuralnetworks.
Tofollowthematerialinthechapter,youdonotneedtohaveread
earlierchaptersinthisbook.Instead,thechapterisstructuredtobe
enjoyableasaselfcontainedessay.Providedyouhavejustalittle
basicfamiliaritywithneuralnetworks,youshouldbeabletofollow
theexplanation.Iwill,however,provideoccasionallinkstoearlier
http://neuralnetworksanddeeplearning.com/chap4.html

2/30

6/2/2016

Neuralnetworksanddeeplearning

material,tohelpfillinanygapsinyourknowledge.
Universalitytheoremsareacommonplaceincomputerscience,so
muchsothatwesometimesforgethowastonishingtheyare.Butit's
worthremindingourselves:theabilitytocomputeanarbitrary
functionistrulyremarkable.Almostanyprocessyoucanimagine
canbethoughtofasfunctioncomputation.Considertheproblemof
namingapieceofmusicbasedonashortsampleofthepiece.That
canbethoughtofascomputingafunction.Orconsidertheproblem
oftranslatingaChinesetextintoEnglish.Again,thatcanbe
thoughtofascomputingafunction*.Orconsidertheproblemof

*Actually,computingoneofmanyfunctions,

takinganmp4moviefileandgeneratingadescriptionoftheplotof

translationsofagivenpieceoftext.

sincethereareoftenmanyacceptable

themovie,andadiscussionofthequalityoftheacting.Again,that
canbethoughtofasakindoffunctioncomputation*.Universality

*Dittotheremarkabouttranslationandthere
beingmanypossiblefunctions.

meansthat,inprinciple,neuralnetworkscandoallthesethings
andmanymore.
Ofcourse,justbecauseweknowaneuralnetworkexiststhatcan
(say)translateChinesetextintoEnglish,thatdoesn'tmeanwehave
goodtechniquesforconstructingorevenrecognizingsucha
network.Thislimitationappliesalsototraditionaluniversality
theoremsformodelssuchasBooleancircuits.But,aswe'veseen
earlierinthebook,neuralnetworkshavepowerfulalgorithmsfor
learningfunctions.Thatcombinationoflearningalgorithms+
universalityisanattractivemix.Uptonow,thebookhasfocused
onthelearningalgorithms.Inthischapter,wefocuson
universality,andwhatitmeans.

Twocaveats
Beforeexplainingwhytheuniversalitytheoremistrue,Iwantto
mentiontwocaveatstotheinformalstatement"aneuralnetwork
cancomputeanyfunction".
First,thisdoesn'tmeanthatanetworkcanbeusedtoexactly
computeanyfunction.Rather,wecangetanapproximationthatis
asgoodaswewant.Byincreasingthenumberofhiddenneurons
wecanimprovetheapproximation.Forinstance,earlierI
illustratedanetworkcomputingsomefunctionf(x) usingthree
hiddenneurons.Formostfunctionsonlyalowquality
approximationwillbepossibleusingthreehiddenneurons.By
http://neuralnetworksanddeeplearning.com/chap4.html

3/30

6/2/2016

Neuralnetworksanddeeplearning

increasingthenumberofhiddenneurons(say,tofive)wecan
typicallygetabetterapproximation:

Andwecandostillbetterbyfurtherincreasingthenumberof
hiddenneurons.
Tomakethisstatementmoreprecise,supposewe'regivena
functionf(x) whichwe'dliketocomputetowithinsomedesired
accuracy > 0 .Theguaranteeisthatbyusingenoughhidden
neuronswecanalwaysfindaneuralnetworkwhoseoutputg(x)
satisfies|g(x) f(x)| < ,forallinputsx.Inotherwords,the
approximationwillbegoodtowithinthedesiredaccuracyforevery
possibleinput.
Thesecondcaveatisthattheclassoffunctionswhichcanbe
approximatedinthewaydescribedarethecontinuousfunctions.If
afunctionisdiscontinuous,i.e.,makessudden,sharpjumps,thenit
won'tingeneralbepossibletoapproximateusinganeuralnet.This
isnotsurprising,sinceourneuralnetworkscomputecontinuous
functionsoftheirinput.However,evenifthefunctionwe'dreally
liketocomputeisdiscontinuous,it'softenthecasethata
continuousapproximationisgoodenough.Ifthat'sso,thenwecan
useaneuralnetwork.Inpractice,thisisnotusuallyanimportant
limitation.
Summingup,amoreprecisestatementoftheuniversalitytheorem
isthatneuralnetworkswithasinglehiddenlayercanbeusedto
approximateanycontinuousfunctiontoanydesiredprecision.In
thischapterwe'llactuallyproveaslightlyweakerversionofthis
http://neuralnetworksanddeeplearning.com/chap4.html

4/30

6/2/2016

Neuralnetworksanddeeplearning

result,usingtwohiddenlayersinsteadofone.IntheproblemsI'll
brieflyoutlinehowtheexplanationcan,withafewtweaks,be
adaptedtogiveaproofwhichusesonlyasinglehiddenlayer.

Universalitywithoneinputandone
output
Tounderstandwhytheuniversalitytheoremistrue,let'sstartby
understandinghowtoconstructaneuralnetworkwhich
approximatesafunctionwithjustoneinputandoneoutput:

Itturnsoutthatthisisthecoreoftheproblemofuniversality.Once
we'veunderstoodthisspecialcaseit'sactuallyprettyeasytoextend
tofunctionswithmanyinputsandmanyoutputs.
Tobuildinsightintohowtoconstructanetworktocomputef ,let's
startwithanetworkcontainingjustasinglehiddenlayer,withtwo
hiddenneurons,andanoutputlayercontainingasingleoutput
neuron:

Togetafeelforhowcomponentsinthenetworkwork,let'sfocuson
thetophiddenneuron.Inthediagrambelow,clickontheweight,
w

,anddragthemousealittlewaystotherighttoincreasew .You

canimmediatelyseehowthefunctioncomputedbythetophidden
http://neuralnetworksanddeeplearning.com/chap4.html

5/30

6/2/2016

Neuralnetworksanddeeplearning

neuronchanges:

Aswelearntearlierinthebook,what'sbeingcomputedbythe
hiddenneuronis(wx + b) ,where(z) 1/(1 + e

isthe

sigmoidfunction.Uptonow,we'vemadefrequentuseofthis
algebraicform.Butfortheproofofuniversalitywewillobtainmore
insightbyignoringthealgebraentirely,andinsteadmanipulating
andobservingtheshapeshowninthegraph.Thiswon'tjustgiveus
abetterfeelforwhat'sgoingon,itwillalsogiveusaproof*of
universalitythatappliestoactivationfunctionsotherthanthe

*Strictlyspeaking,thevisualapproachI'm

sigmoidfunction.

proof.ButIbelievethevisualapproachgives

Togetstartedonthisproof,tryclickingonthebias,b,inthe
diagramabove,anddraggingtotherighttoincreaseit.You'llsee

takingisn'twhat'straditionallythoughtofasa
moreinsightintowhytheresultistruethana
traditionalproof.And,ofcourse,thatkindof
insightistherealpurposebehindaproof.
Occasionally,therewillbesmallgapsinthe
reasoningIpresent:placeswhereImakea

thatasthebiasincreasesthegraphmovestotheleft,butitsshape

visualargumentthatisplausible,butnotquite

doesn'tchange.

challengetofillinthemissingsteps.Butdon't

rigorous.Ifthisbothersyou,thenconsiderita
losesightoftherealpurpose:tounderstandwhy

Next,clickanddragtotheleftinordertodecreasethebias.You'll

theuniversalitytheoremistrue.

seethatasthebiasdecreasesthegraphmovestotheright,but,
again,itsshapedoesn'tchange.
Next,decreasetheweighttoaround2or3.You'llseethatasyou
decreasetheweight,thecurvebroadensout.Youmightneedto
changethebiasaswell,inordertokeepthecurveinframe.
Finally,increasetheweightuppastw = 100 .Asyoudo,thecurve
getssteeper,untileventuallyitbeginstolooklikeastepfunction.
Trytoadjustthebiassothestepoccursnearx = 0.3 .Thefollowing
shortclipshowswhatyourresultshouldlooklike.Clickontheplay
buttontoplay(orreplay)thevideo:

http://neuralnetworksanddeeplearning.com/chap4.html

6/30

6/2/2016

Neuralnetworksanddeeplearning

Wecansimplifyouranalysisquiteabitbyincreasingtheweightso
muchthattheoutputreallyisastepfunction,toaverygood
approximation.BelowI'veplottedtheoutputfromthetophidden
neuronwhentheweightisw = 999 .Notethatthisplotisstatic,and
youcan'tchangeparameterssuchastheweight.

It'sactuallyquiteabiteasiertoworkwithstepfunctionsthan
generalsigmoidfunctions.Thereasonisthatintheoutputlayerwe
addupcontributionsfromallthehiddenneurons.It'seasyto
analyzethesumofabunchofstepfunctions,butrathermore
difficulttoreasonaboutwhathappenswhenyouaddupabunchof
sigmoidshapedcurves.Andsoitmakesthingsmucheasierto
assumethatourhiddenneuronsareoutputtingstepfunctions.
Moreconcretely,wedothisbyfixingtheweightw tobesomevery
largevalue,andthensettingthepositionofthestepbymodifying
thebias.Ofcourse,treatingtheoutputasastepfunctionisan
approximation,butit'saverygoodapproximation,andfornow
we'lltreatitasexact.I'llcomebacklatertodiscusstheimpactof
deviationsfromthisapproximation.
Atwhatvalueofxdoesthestepoccur?Putanotherway,howdoes
thepositionofthestepdependupontheweightandbias?
http://neuralnetworksanddeeplearning.com/chap4.html

7/30

6/2/2016

Neuralnetworksanddeeplearning

Toanswerthisquestion,trymodifyingtheweightandbiasinthe
diagramabove(youmayneedtoscrollbackabit).Canyoufigure
outhowthepositionofthestepdependsonw andb?Withalittle
workyoushouldbeabletoconvinceyourselfthatthepositionofthe
stepisproportionaltob,andinverselyproportionaltow .
Infact,thestepisatpositions = b/w,asyoucanseeby
modifyingtheweightandbiasinthefollowingdiagram:

Itwillgreatlysimplifyourlivestodescribehiddenneuronsusing
justasingleparameter,s,whichisthestepposition,s = b/w.Try
modifyingsinthefollowingdiagram,inordertogetusedtothe
newparameterization:

Asnotedabove,we'veimplicitlysettheweightw ontheinputtobe
somelargevaluebigenoughthatthestepfunctionisaverygood
approximation.Wecaneasilyconvertaneuronparameterizedin
thiswaybackintotheconventionalmodel,bychoosingthebias
b = ws

Uptonowwe'vebeenfocusingontheoutputfromjustthetop
hiddenneuron.Let'stakealookatthebehavioroftheentire
network.Inparticular,we'llsupposethehiddenneuronsare
http://neuralnetworksanddeeplearning.com/chap4.html

8/30

6/2/2016

Neuralnetworksanddeeplearning

computingstepfunctionsparameterizedbysteppointss (top
1

neuron)ands (bottomneuron).Andthey'llhaverespectiveoutput
2

weightsw andw .Here'sthenetwork:


1

What'sbeingplottedontherightistheweightedoutput
w 1 a1 + w 2 a2

fromthehiddenlayer.Here,a anda aretheoutputs


1

fromthetopandbottomhiddenneurons,respectively*.These

*Note,bytheway,thattheoutputfromthe

outputsaredenotedwithasbecausethey'reoftenknownasthe

thebiasontheoutputneuron.Obviously,this

wholenetworkis(w

1 a1

,whereb is

+ w 2 a2 + b)

isn'tthesameastheweightedoutputfromthe

neurons'activations.

hiddenlayer,whichiswhatwe'replottinghere.
We'regoingtofocusontheweightedoutput

Tryincreasinganddecreasingthesteppoints ofthetophidden

fromthehiddenlayerrightnow,andonlylater

neuron.Getafeelforhowthischangestheweightedoutputfrom

fromthewholenetwork.

willwethinkabouthowthatrelatestotheoutput

thehiddenlayer.It'sparticularlyworthunderstandingwhat
happenswhens goespasts .You'llseethatthegraphchanges
1

shapewhenthishappens,sincewehavemovedfromasituation
wherethetophiddenneuronisthefirsttobeactivatedtoa
situationwherethebottomhiddenneuronisthefirsttobe
activated.
Similarly,trymanipulatingthesteppoints ofthebottomhidden
2

neuron,andgetafeelforhowthischangesthecombinedoutput
fromthehiddenneurons.
Tryincreasinganddecreasingeachoftheoutputweights.Notice
howthisrescalesthecontributionfromtherespectivehidden
neurons.Whathappenswhenoneoftheweightsiszero?
Finally,trysettingw tobe0.8 andw tobe0.8.Yougeta"bump"
1

function,whichstartsatpoints ,endsatpoints ,andhasheight


1

0.8

.Forinstance,theweightedoutputmightlooklikethis:

http://neuralnetworksanddeeplearning.com/chap4.html

9/30

6/2/2016

Neuralnetworksanddeeplearning

Ofcourse,wecanrescalethebumptohaveanyheightatall.Let's
useasingleparameter,h,todenotetheheight.ToreduceclutterI'll
alsoremovethe"s

"and"w

"notations.

Trychangingthevalueofhupanddown,toseehowtheheightof
thebumpchanges.Trychangingtheheightsoit'snegative,and
observewhathappens.Andtrychangingthesteppointstoseehow
thatchangestheshapeofthebump.
You'llnotice,bytheway,thatwe'reusingourneuronsinawaythat
canbethoughtofnotjustingraphicalterms,butinmore
conventionalprogrammingterms,asakindofifthenelse
statement,e.g.:
ifinput>=steppoint:
add1totheweightedoutput
else:
add0totheweightedoutput

ForthemostpartI'mgoingtostickwiththegraphicalpointofview.
Butinwhatfollowsyoumaysometimesfindithelpfultoswitch
pointsofview,andthinkaboutthingsintermsofifthenelse.
Wecanuseourbumpmakingtricktogettwobumps,bygluingtwo
http://neuralnetworksanddeeplearning.com/chap4.html

10/30

6/2/2016

Neuralnetworksanddeeplearning

pairsofhiddenneuronstogetherintothesamenetwork:

I'vesuppressedtheweightshere,simplywritingthehvaluesfor
eachpairofhiddenneurons.Tryincreasinganddecreasingbothh
values,andobservehowitchangesthegraph.Movethebumps
aroundbychangingthesteppoints.
Moregenerally,wecanusethisideatogetasmanypeaksaswe
want,ofanyheight.Inparticular,wecandividetheinterval[0, 1] up
intoalargenumber,N ,ofsubintervals,anduseN pairsofhidden
neuronstosetuppeaksofanydesiredheight.Let'sseehowthis
worksforN

= 5

.That'squiteafewneurons,soI'mgoingtopack

thingsinabit.Apologiesforthecomplexityofthediagram:Icould
hidethecomplexitybyabstractingawayfurther,butIthinkit's
worthputtingupwithalittlecomplexity,forthesakeofgettinga
moreconcretefeelforhowthesenetworkswork.

http://neuralnetworksanddeeplearning.com/chap4.html

11/30

6/2/2016

Neuralnetworksanddeeplearning

Youcanseethattherearefivepairsofhiddenneurons.Thestep
pointsfortherespectivepairsofneuronsare0, 1/5,then1/5, 2/5 ,
andsoon,outto4/5, 5/5 .Thesevaluesarefixedtheymakeitso
wegetfiveevenlyspacedbumpsonthegraph.
Eachpairofneuronshasavalueofhassociatedtoit.Remember,
theconnectionsoutputfromtheneuronshaveweightshandh
(notmarked).Clickononeofthehvalues,anddragthemouseto
therightorlefttochangethevalue.Asyoudoso,watchthe
functionchange.Bychangingtheoutputweightswe'reactually
designingthefunction!
Contrariwise,tryclickingonthegraph,anddraggingupordownto
changetheheightofanyofthebumpfunctions.Asyouchangethe
heights,youcanseethecorrespondingchangeinhvalues.And,
althoughit'snotshown,thereisalsoachangeinthecorresponding
outputweights,whichare+h andh .
Inotherwords,wecandirectlymanipulatethefunctionappearing
inthegraphontheright,andseethatreflectedinthehvalueson
theleft.Afunthingtodoistoholdthemousebuttondownand
dragthemousefromonesideofthegraphtotheother.Asyoudo
http://neuralnetworksanddeeplearning.com/chap4.html

12/30

6/2/2016

Neuralnetworksanddeeplearning

thisyoudrawoutafunction,andgettowatchtheparametersinthe
neuralnetworkadapt.
Timeforachallenge.
Let'sthinkbacktothefunctionIplottedatthebeginningofthe
chapter:

Ididn'tsayitatthetime,butwhatIplottedisactuallythefunction
f(x) = 0.2 + 0.4x

+ 0.3x sin(15x) + 0.05 cos(50x),

(113)

plottedoverxfrom0to1,andwiththey axistakingvaluesfrom0
to1.
That'sobviouslynotatrivialfunction.
You'regoingtofigureouthowtocomputeitusinganeural
network.
Inournetworksabovewe'vebeenanalyzingtheweighted
combination

w j aj

outputfromthehiddenneurons.Wenow

knowhowtogetalotofcontroloverthisquantity.But,asInoted
earlier,thisquantityisnotwhat'soutputfromthenetwork.What's
outputfromthenetworkis(

w j aj + b)

wherebisthebiason

theoutputneuron.Istheresomewaywecanachievecontrolover
theactualoutputfromthenetwork?
Thesolutionistodesignaneuralnetworkwhosehiddenlayerhasa
weightedoutputgivenby

f(x)

,where

isjusttheinverse

ofthefunction.Thatis,wewanttheweightedoutputfromthe
hiddenlayertobe:

http://neuralnetworksanddeeplearning.com/chap4.html

13/30

6/2/2016

Neuralnetworksanddeeplearning

Ifwecandothis,thentheoutputfromthenetworkasawholewill
beagoodapproximationtof(x) *.

*NotethatIhavesetthebiasontheoutput
neuronto0 .

Yourchallenge,then,istodesignaneuralnetworktoapproximate
thegoalfunctionshownjustabove.Tolearnasmuchaspossible,I
wantyoutosolvetheproblemtwice.Thefirsttime,pleaseclickon
thegraph,directlyadjustingtheheightsofthedifferentbump
functions.Youshouldfinditfairlyeasytogetagoodmatchtothe
goalfunction.Howwellyou'redoingismeasuredbytheaverage
deviationbetweenthegoalfunctionandthefunctionthenetworkis
actuallycomputing.Yourchallengeistodrivetheaveragedeviation
aslowaspossible.Youcompletethechallengewhenyoudrivethe
averagedeviationto0.40 orbelow.
Onceyou'vedonethat,clickon"Reset"torandomlyreinitializethe
bumps.Thesecondtimeyousolvetheproblem,resisttheurgeto
clickonthegraph.Instead,modifythehvaluesonthelefthand
side,andagainattempttodrivetheaveragedeviationto0.40 or
below.

http://neuralnetworksanddeeplearning.com/chap4.html

14/30

6/2/2016

Neuralnetworksanddeeplearning

You'venowfiguredoutalltheelementsnecessaryforthenetwork
toapproximatelycomputethefunctionf(x) !It'sonlyacoarse
approximation,butwecouldeasilydomuchbetter,merelyby
increasingthenumberofpairsofhiddenneurons,allowingmore
bumps.
Inparticular,it'seasytoconvertallthedatawehavefoundback
intothestandardparameterizationusedforneuralnetworks.Let
mejustrecapquicklyhowthatworks.
Thefirstlayerofweightsallhavesomelarge,constantvalue,say
w = 1000

Thebiasesonthehiddenneuronsarejustb = ws .So,for
instance,forthesecondhiddenneurons = 0.2 becomes
b = 1000 0.2 = 200

Thefinallayerofweightsaredeterminedbythehvalues.So,for
instance,thevalueyou'vechosenaboveforthefirsth,h = 1.3,
meansthattheoutputweightsfromthetoptwohiddenneuronsare
and-1.3,respectively.Andsoon,fortheentirelayerofoutput

1.3

weights.
http://neuralnetworksanddeeplearning.com/chap4.html

15/30

6/2/2016

Neuralnetworksanddeeplearning

Finally,thebiasontheoutputneuronis0.
That'severything:wenowhaveacompletedescriptionofaneural
networkwhichdoesaprettygoodjobcomputingouroriginalgoal
function.Andweunderstandhowtoimprovethequalityofthe
approximationbyimprovingthenumberofhiddenneurons.
What'smore,therewasnothingspecialaboutouroriginalgoal
function,f(x) = 0.2 + 0.4x

+ 0.3 sin(15x) + 0.05 cos(50x)

.We

couldhaveusedthisprocedureforanycontinuousfunctionfrom
[0, 1]

to[0, 1] .Inessence,we'reusingoursinglelayerneural

networkstobuildalookuptableforthefunction.Andwe'llbeable
tobuildonthisideatoprovideageneralproofofuniversality.

Manyinputvariables
Let'sextendourresultstothecaseofmanyinputvariables.This
soundscomplicated,butalltheideasweneedcanbeunderstoodin
thecaseofjusttwoinputs.Solet'saddressthetwoinputcase.
We'llstartbyconsideringwhathappenswhenwehavetwoinputs
toaneuron:

Here,wehaveinputsxandy ,withcorrespondingweightsw and


1

w2

,andabiasbontheneuron.Let'ssettheweightw to0,and
2

thenplayaroundwiththefirstweight,w ,andthebias,b,tosee
1

howtheyaffecttheoutputfromtheneuron:
Output

=1

=1

http://neuralnetworksanddeeplearning.com/chap4.html

16/30

6/2/2016

Neuralnetworksanddeeplearning

Asyoucansee,withw

= 0

theinputy makesnodifferencetothe

outputfromtheneuron.It'sasthoughxistheonlyinput.
Giventhis,whatdoyouthinkhappenswhenweincreasetheweight
w1

tow

= 100

,withw remaining0?Ifyoudon'timmediatelysee
2

theanswer,ponderthequestionforabit,andseeifyoucanfigure
outwhathappens.Thentryitoutandseeifyou'reright.I'veshown
whathappensinthefollowingmovie:

Justasinourearlierdiscussion,astheinputweightgetslargerthe
outputapproachesastepfunction.Thedifferenceisthatnowthe
stepfunctionisinthreedimensions.Alsoasbefore,wecanmove
thelocationofthesteppointaroundbymodifyingthebias.The
actuallocationofthesteppointiss

b/w 1

Let'sredotheaboveusingthepositionofthestepastheparameter:
Output

=1

=1

Here,weassumetheweightonthexinputhassomelargevalue
I'veusedw

= 1000

andtheweightw

= 0

.Thenumberonthe

neuronisthesteppoint,andthelittlexabovethenumberreminds
usthatthestepisinthexdirection.Ofcourse,it'salsopossibleto
getastepfunctioninthey direction,bymakingtheweightonthey
inputverylarge(say,w
0

,i.e.,w

= 0

= 1000

),andtheweightonthexequalto

http://neuralnetworksanddeeplearning.com/chap4.html

17/30

6/2/2016

Neuralnetworksanddeeplearning

Output

=1

=1

Thenumberontheneuronisagainthesteppoint,andinthiscase
thelittley abovethenumberremindsusthatthestepisinthey
direction.Icouldhaveexplicitlymarkedtheweightsonthexandy
inputs,butdecidednotto,sinceitwouldmakethediagramrather
cluttered.Butdokeepinmindthatthelittley markerimplicitly
tellsusthatthey weightislarge,andthexweightis0.
Wecanusethestepfunctionswe'vejustconstructedtocomputea
threedimensionalbumpfunction.Todothis,weusetwoneurons,
eachcomputingastepfunctioninthexdirection.Thenwecombine
thosestepfunctionswithweighthandh ,respectively,wherehis
thedesiredheightofthebump.It'sallillustratedinthefollowing
diagram:
Weightedoutputfromhiddenlayer

=1

=1

Trychangingthevalueoftheheight,h.Observehowitrelatestothe
weightsinthenetwork.Andseehowitchangestheheightofthe
bumpfunctionontheright.
Also,trychangingthesteppoint0.30 associatedtothetophidden
neuron.Witnesshowitchangestheshapeofthebump.What
happenswhenyoumoveitpastthesteppoint0.70 associatedtothe
bottomhiddenneuron?
We'vefiguredouthowtomakeabumpfunctioninthexdirection.
Ofcourse,wecaneasilymakeabumpfunctioninthey direction,by
usingtwostepfunctionsinthey direction.Recallthatwedothisby
makingtheweightlargeonthey input,andtheweight0onthex
http://neuralnetworksanddeeplearning.com/chap4.html

18/30

6/2/2016

Neuralnetworksanddeeplearning

input.Here'stheresult:
Weightedoutputfromhiddenlayer

=1

=1

Thislooksnearlyidenticaltotheearliernetwork!Theonlything
explicitlyshownaschangingisthatthere'snowlittley markerson
ourhiddenneurons.Thatremindsusthatthey'reproducingy step
functions,notxstepfunctions,andsotheweightisverylargeon
they input,andzeroonthexinput,notviceversa.Asbefore,I
decidednottoshowthisexplicitly,inordertoavoidclutter.
Let'sconsiderwhathappenswhenweadduptwobumpfunctions,
oneinthexdirection,theotherinthey direction,bothofheighth:
Weightedoutputfromhiddenlayer

=1

=1

TosimplifythediagramI'vedroppedtheconnectionswithzero
weight.Fornow,I'veleftinthelittlexandy markersonthehidden
neurons,toremindyouinwhatdirectionsthebumpfunctionsare
beingcomputed.We'lldropeventhosemarkerslater,sincethey're
impliedbytheinputvariable.
Tryvaryingtheparameterh.Asyoucansee,thiscausestheoutput
weightstochange,andalsotheheightsofboththexandy bump
functions.
Whatwe'vebuiltlooksalittlelikeatowerfunction:
Towerfunction

http://neuralnetworksanddeeplearning.com/chap4.html

19/30

6/2/2016

Neuralnetworksanddeeplearning

=1

=1

Ifwecouldbuildsuchtowerfunctions,thenwecouldusethemto
approximatearbitraryfunctions,justbyaddingupmanytowersof
differentheights,andindifferentlocations:
Manytowers

=1

=1

Ofcourse,wehaven'tyetfiguredouthowtobuildatowerfunction.
Whatwehaveconstructedlookslikeacentraltower,ofheight2h ,
withasurroundingplateau,ofheighth.
Butwecanmakeatowerfunction.Rememberthatearlierwesaw
neuronscanbeusedtoimplementatypeofifthenelsestatement:
ifinput>=threshold:
output1
else:
output0

Thatwasforaneuronwithjustasingleinput.Whatwewantisto
applyasimilarideatothecombinedoutputfromthehidden
neurons:
ifcombinedoutputfromhiddenneurons>=threshold:
output1
else:
output0

Ifwechoosethethresholdappropriatelysay,avalueof3h/2 ,
whichissandwichedbetweentheheightoftheplateauandthe
heightofthecentraltowerwecouldsquashtheplateaudownto
zero,andleavejustthetowerstanding.
Canyouseehowtodothis?Tryexperimentingwiththefollowing
networktofigureitout.Notethatwe'renowplottingtheoutput
fromtheentirenetwork,notjusttheweightedoutputfromthe
hiddenlayer.Thismeansweaddabiastermtotheweightedoutput
http://neuralnetworksanddeeplearning.com/chap4.html

20/30

6/2/2016

Neuralnetworksanddeeplearning

fromthehiddenlayer,andapplythesigmafunction.Canyoufind
valuesforhandbwhichproduceatower?Thisisabittricky,soif
youthinkaboutthisforawhileandremainstuck,here'stwohints:
(1)Togettheoutputneurontoshowtherightkindofifthenelse
behaviour,weneedtheinputweights(allhorh )tobelargeand
(2)thevalueofbdeterminesthescaleoftheifthenelsethreshold.
Output

=1

=1

Withourinitialparameters,theoutputlookslikeaflattened
versionoftheearlierdiagram,withitstowerandplateau.Togetthe
desiredbehaviour,weincreasetheparameterhuntilitbecomes
large.Thatgivestheifthenelsethresholdingbehaviour.Second,
togetthethresholdright,we'llchooseb 3h/2.Tryit,andsee
howitworks!
Here'swhatitlookslike,whenweuseh = 10 :

Evenforthisrelativelymodestvalueofh,wegetaprettygood
towerfunction.And,ofcourse,wecanmakeitasgoodaswewant
byincreasinghstillfurther,andkeepingthebiasasb = 3h/2.
Let'strygluingtwosuchnetworkstogether,inordertocompute
twodifferenttowerfunctions.Tomaketherespectiverolesofthe
twosubnetworksclearI'veputtheminseparateboxes,below:each
http://neuralnetworksanddeeplearning.com/chap4.html

21/30

6/2/2016

Neuralnetworksanddeeplearning

boxcomputesatowerfunction,usingthetechniquedescribed
above.Thegraphontherightshowstheweightedoutputfromthe
secondhiddenlayer,thatis,it'saweightedcombinationoftower
functions.

Weightedoutput

=1

=1

Inparticular,youcanseethatbymodifyingtheweightsinthefinal
layeryoucanchangetheheightoftheoutputtowers.
Thesameideacanbeusedtocomputeasmanytowersaswelike.
Wecanalsomakethemasthinaswelike,andwhateverheightwe
like.Asaresult,wecanensurethattheweightedoutputfromthe
secondhiddenlayerapproximatesanydesiredfunctionoftwo
variables:
Manytowers

=1

=1

Inparticular,bymakingtheweightedoutputfromthesecond
hiddenlayeragoodapproximationto
http://neuralnetworksanddeeplearning.com/chap4.html

,weensuretheoutput
22/30

6/2/2016

Neuralnetworksanddeeplearning

fromournetworkwillbeagoodapproximationtoanydesired
function,f .
Whataboutfunctionsofmorethantwovariables?
Let'strythreevariablesx

1,

x2 , x3

.Thefollowingnetworkcanbe

usedtocomputeatowerfunctioninfourdimensions:

Here,thex

1,

x2 , x3

denoteinputstothenetwork.Thes

1,

t1

andso

onaresteppointsforneuronsthatis,alltheweightsinthefirst
layerarelarge,andthebiasesaresettogivethesteppoints
s 1 , t1 , s 2 ,

.Theweightsinthesecondlayeralternate+h, h ,

wherehissomeverylargenumber.Andtheoutputbiasis5h/2 .
Thisnetworkcomputesafunctionwhichis1providedthree
conditionsaremet:x isbetweens andt x isbetweens and
1

t2

andx isbetweens andt .Thenetworkis0everywhereelse.


3

Thatis,it'sakindoftowerwhichis1inalittleregionofinput
space,and0everywhereelse.
Bygluingtogethermanysuchnetworkswecangetasmanytowers
aswewant,andsoapproximateanarbitraryfunctionofthree
variables.Exactlythesameideaworksinm dimensions.Theonly
changeneededistomaketheoutputbias(m + 1/2)h ,inorderto
gettherightkindofsandwichingbehaviortoleveltheplateau.
Okay,sowenowknowhowtouseneuralnetworkstoapproximate
arealvaluedfunctionofmanyvariables.Whataboutvectorvalued
functionsf(x

1,

, xm ) R

?Ofcourse,suchafunctioncanbe

http://neuralnetworksanddeeplearning.com/chap4.html

23/30

6/2/2016

Neuralnetworksanddeeplearning

regardedasjustn separaterealvaluedfunctions,
f

(x 1 , , x m ), f

(x 1 , , x m )

,andsoon.Sowecreateanetwork

approximatingf ,anothernetworkforf ,andsoon.Andthenwe


1

simplyglueallthenetworkstogether.Sothat'salsoeasytocope
with.

Problem
We'veseenhowtousenetworkswithtwohiddenlayersto
approximateanarbitraryfunction.Canyoufindaproof
showingthatit'spossiblewithjustasinglehiddenlayer?Asa
hint,tryworkinginthecaseofjusttwoinputvariables,and
showingthat:(a)it'spossibletogetstepfunctionsnotjustin
thexory directions,butinanarbitrarydirection(b)by
addingupmanyoftheconstructionsfrompart(a)it'spossible
toapproximateatowerfunctionwhichiscircularinshape,
ratherthanrectangular(c)usingthesecirculartowers,it's
possibletoapproximateanarbitraryfunction.Todopart(c)it
mayhelptouseideasfromabitlaterinthischapter.

Extensionbeyondsigmoidneurons
We'veprovedthatnetworksmadeupofsigmoidneuronscan
computeanyfunction.Recallthatinasigmoidneurontheinputs
x1 , x2 ,

resultintheoutput(

w j x j + b)

,wherew arethe
j

weights,bisthebias,andisthesigmoidfunction:

Whatifweconsideradifferenttypeofneuron,oneusingsome
otheractivationfunction,s(z):

http://neuralnetworksanddeeplearning.com/chap4.html

24/30

6/2/2016

Neuralnetworksanddeeplearning

Thatis,we'llassumethatifourneuronshasinputsx
weightsw

1,

w2 ,

andbiasb,thentheoutputiss(

1,

x2 ,

w j x j + b)

Wecanusethisactivationfunctiontogetastepfunction,justaswe
didwiththesigmoid.Tryrampinguptheweightinthefollowing,
saytow = 100 :

Justaswiththesigmoid,thiscausestheactivationfunctionto
contract,andultimatelyitbecomesaverygoodapproximationtoa
stepfunction.Trychangingthebias,andyou'llseethatwecanset
thepositionofthesteptobewhereverwechoose.Andsowecan
useallthesametricksasbeforetocomputeanydesiredfunction.
Whatpropertiesdoess(z)needtosatisfyinorderforthistowork?
Wedoneedtoassumethats(z)iswelldefinedasz
z

and

.Thesetwolimitsarethetwovaluestakenonbyourstep

function.Wealsoneedtoassumethattheselimitsaredifferent
fromoneanother.Iftheyweren't,there'dbenostep,simplyaflat
graph!Butprovidedtheactivationfunctions(z)satisfiesthese
properties,neuronsbasedonsuchanactivationfunctionare
universalforcomputation.

Problems
Earlierinthebookwemetanothertypeofneuronknownasa
rectifiedlinearunit.Explainwhysuchneuronsdon'tsatisfythe
conditionsjustgivenforuniversality.Findaproofof
universalityshowingthatrectifiedlinearunitsareuniversalfor
http://neuralnetworksanddeeplearning.com/chap4.html

25/30

6/2/2016

Neuralnetworksanddeeplearning

computation.
Supposeweconsiderlinearneurons,i.e.,neuronswiththe
activationfunctions(z) = z .Explainwhylinearneuronsdon't
satisfytheconditionsjustgivenforuniversality.Showthat
suchneuronscan'tbeusedtodouniversalcomputation.

Fixingupthestepfunctions
Uptonow,we'vebeenassumingthatourneuronscanproducestep
functionsexactly.That'saprettygoodapproximation,butitisonly
anapproximation.Infact,therewillbeanarrowwindowoffailure,
illustratedinthefollowinggraph,inwhichthefunctionbehaves
verydifferentlyfromastepfunction:

InthesewindowsoffailuretheexplanationI'vegivenfor
universalitywillfail.
Now,it'snotaterriblefailure.Bymakingtheweightsinputtothe
neuronsbigenoughwecanmakethesewindowsoffailureassmall
aswelike.Certainly,wecanmakethewindowmuchnarrowerthan
I'veshownabovenarrower,indeed,thanoureyecouldsee.So
perhapswemightnotworrytoomuchaboutthisproblem.
Nonetheless,it'dbenicetohavesomewayofaddressingthe
problem.
Infact,theproblemturnsouttobeeasytofix.Let'slookatthefix
forneuralnetworkscomputingfunctionswithjustoneinputand
oneoutput.Thesameideasworkalsotoaddresstheproblemwhen
therearemoreinputsandoutputs.
Inparticular,supposewewantournetworktocomputesome
function,f .Asbefore,wedothisbytryingtodesignournetworkso
thattheweightedoutputfromourhiddenlayerofneuronsis

f(x)

http://neuralnetworksanddeeplearning.com/chap4.html

26/30

6/2/2016

Neuralnetworksanddeeplearning

Ifweweretodothisusingthetechniquedescribedearlier,we'duse
thehiddenneuronstoproduceasequenceofbumpfunctions:

Again,I'veexaggeratedthesizeofthewindowsoffailure,inorder
tomakethemeasiertosee.Itshouldbeprettyclearthatifweadd
allthesebumpfunctionsupwe'llendupwithareasonable
approximationto

f(x)

,exceptwithinthewindowsoffailure.

Supposethatinsteadofusingtheapproximationjustdescribed,we
useasetofhiddenneuronstocomputeanapproximationtohalf
ouroriginalgoalfunction,i.e.,to

f(x)/2

.Ofcourse,thislooks

justlikeascaleddownversionofthelastgraph:

http://neuralnetworksanddeeplearning.com/chap4.html

27/30

6/2/2016

Neuralnetworksanddeeplearning

Andsupposeweuseanothersetofhiddenneuronstocomputean
approximationto

f(x)/2

,butwiththebasesofthebumps

shiftedbyhalfthewidthofabump:

Nowwehavetwodifferentapproximationsto

f(x)/2

.Ifwe

addupthetwoapproximationswe'llgetanoverallapproximation
to

f(x)

.Thatoverallapproximationwillstillhavefailuresin

smallwindows.Buttheproblemwillbemuchlessthanbefore.The
reasonisthatpointsinafailurewindowforoneapproximation
won'tbeinafailurewindowfortheother.Andsothe
approximationwillbeafactorroughly2betterinthosewindows.
Wecoulddoevenbetterbyaddingupalargenumber,M ,of
overlappingapproximationstothefunction

f(x)/M

Providedthewindowsoffailurearenarrowenough,apointwill
onlyeverbeinonewindowoffailure.Andprovidedwe'reusinga
largeenoughnumberM ofoverlappingapproximations,theresult
willbeanexcellentoverallapproximation.

Conclusion
http://neuralnetworksanddeeplearning.com/chap4.html

28/30

6/2/2016

Neuralnetworksanddeeplearning

Theexplanationforuniversalitywe'vediscussediscertainlynota
practicalprescriptionforhowtocomputeusingneuralnetworks!In
this,it'smuchlikeproofsofuniversalityforNANDgatesandthelike.
Forthisreason,I'vefocusedmostlyontryingtomakethe
constructionclearandeasytofollow,andnotonoptimizingthe
detailsoftheconstruction.However,youmayfinditafunand
instructiveexercisetoseeifyoucanimprovetheconstruction.
Althoughtheresultisn'tdirectlyusefulinconstructingnetworks,
it'simportantbecauseittakesoffthetablethequestionofwhether
anyparticularfunctioniscomputableusinganeuralnetwork.The
answertothatquestionisalways"yes".Sotherightquestiontoask
isnotwhetheranyparticularfunctioniscomputable,butrather
what'sagoodwaytocomputethefunction.
Theuniversalityconstructionwe'vedevelopedusesjusttwohidden
layerstocomputeanarbitraryfunction.Furthermore,aswe've
discussed,it'spossibletogetthesameresultwithjustasingle
hiddenlayer.Giventhis,youmightwonderwhywewouldeverbe
interestedindeepnetworks,i.e.,networkswithmanyhidden
layers.Can'twesimplyreplacethosenetworkswithshallow,single
hiddenlayernetworks?
Whileinprinciplethat'spossible,therearegoodpracticalreasons

Chapteracknowledgments:ThankstoJen

tousedeepnetworks.AsarguedinChapter1,deepnetworkshavea

universalityinneuralnetworks.Mythanks,in

hierarchicalstructurewhichmakesthemparticularlywelladapted

DoddandChrisOlahformanydiscussionsabout
particular,toChrisforsuggestingtheuseofa
lookuptabletoproveuniversality.The

tolearnthehierarchiesofknowledgethatseemtobeusefulin

interactivevisualformofthechapterisinspired

solvingrealworldproblems.Putmoreconcretely,whenattacking

Patel,BretVictor,andStevenWittens.

bytheworkofpeoplesuchasMikeBostock,Amit

problemssuchasimagerecognition,ithelpstouseasystemthat
understandsnotjustindividualpixels,butalsoincreasinglymore
complexconcepts:fromedgestosimplegeometricshapes,allthe
wayupthroughcomplex,multiobjectscenes.Inlaterchapters,
we'llseeevidencesuggestingthatdeepnetworksdoabetterjob
thanshallownetworksatlearningsuchhierarchiesofknowledge.
Tosumup:universalitytellsusthatneuralnetworkscancompute
anyfunctionandempiricalevidencesuggeststhatdeepnetworks
arethenetworksbestadaptedtolearnthefunctionsusefulin
solvingmanyrealworldproblems.
..
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",

http://neuralnetworksanddeeplearning.com/chap4.html

Lastupdate:FriJan2214:09:502016

29/30

6/2/2016

Neuralnetworksanddeeplearning

DeterminationPress,2015
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.

http://neuralnetworksanddeeplearning.com/chap4.html

30/30

You might also like