Learning is generally defined as adapting the weights of the neural network in order to
implement a specific mapping, based on a training set. In this section we will unfold the
statistical theory of learning and give a precise meaning to the somewhat naive
definition of learning given above. At first, we view learning as an approximation task,
then we pursue a statistical investigation into the problem.
Learning as an approximation
Given a function ( )
2
L x F e and a network architecture
( )
( ) ( ) ( )


.

\



.

\


.

\

=
= =
=
2 1
1 1
0 1
1
,
L L
l
j
N
n
n en
L
ij
l
i
L
i
x w w w w x Net
Find ( ) ( )
2
2
, min :
L
w
opt
w x Net x F w
This problem is analogous to the problem of finding the optimal Fourier coefficients for
a function ( ) x F . There are two approaches to solve this problem depending on the
construction of the network.
Learning by onelayered network
In this case one can draw from the results of Fourier analysis namely any
Learning from examples:
Instead of the function ( ) x F to be approximated given a training set with size K
( )
( ) { } K k d x
k k
K
, , 1 , , = = t
where X x
k
e and ( )
k k
x F d = is the socalled desired output. One can then define an
empirical error over the training set, e.g.
( ) ( )
K
k
k k
w x Net d
K
1
2
,
1
The weight optimization or learning then reduces to finding
( )
( ) ( )
2
1
1
min :
K
k
k k
w
K
opt
w x Net d
K
w
If the sequence X x
k
e is drawn randomly from X subject to uniform distribution the
( ) ( ) ( ) ( ) ( ) ( ) ( ) x d x p
k k
K
k
k k
w x Net d w x Net d E w x Net d
K
2 2
1
2
, , ,
1
} = ~
=
the function ( ) ( ) ( )
=
=
K
k
k k
w x Net d l
K
w J
1
,
1
is called empirical error function.
There are two fundamental questions which can arise with regard to
( ) K
opt
w .
 What is the connection between
( ) ( )
2
1
,
1
min
=
e
K
k
k k
W w
w x Net d
K
and ( ) ( ) ( ) x d x p
w
w x Net d
2
, min }
 If this relationship is specified how to minimize ( ) w J in order to find
( ) K
opt
w
As one can see the first question is a statistical one and the second is purely an
optimization task. Thus, we treat learning in two subsections
 statistical learning theory
 learning as an optimization problem
Statistical learning theory
Let us assume that the samples in the learning set
k
x are drawn from the input space X
independently and subject to uniform distribution. It is also assumed that the desired
response is a function of x given as ( ) x F d = .
Here ( ) F can be a deterministic function or a stochastic one. In the latter case
( )
zaj
x g d
+
+ =
First, without dealing with the training set, let us analyze the expression when ( ) x F d =
is a deterministic function ( ) ( ) ( ) ( ) ( )
n n
X
dx dx x x p w x Net d w x Net d E
1 1
2 2
, , } } =
since ( )
X
x x p
n
1
1
= due to the uniformity
( ) ( ) ( ) ( )
2
, ,
1
1
2
L
n
X
w x Net x F dx dx w x Net d
X
~ } }
Thus, finding ( ) ( )
2
, min : w x Net d E w
w
opt
will minimize the approximation error in the
corresponding
2
L space. Allowing stochastic mappings between d and x (e.g.
( ) + = x f d ) finding ( ) ( )
2
, min : w x Net d E w
w
opt
costs a nonlinear regression
problem. In order to solve this let ( ) ( )dd x dp x d E
}
= = : . It is useful to recall that
( ) ( ) ( ) ( )
} } }
= = } } = } } = Ed dd d dp dd dx dx x d p d dx dddx x d dp x d EE
n
X
n
X
1 1
Now (xxx) can be rewritten as
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( )
2
2 2 2
,
, 2 , ,
w x Net E
w x Net d E d E w x Net d E w x Net d E
+
+ + = + =
In the term ( ) ( ) ( ) w x Net d E , 2 one can use the fact that d and ( ) w x Net ,
are statistically independent, therefore
( ) ( ) ( ) ( ) ( ) ( ) C = = w x Net E d E w x Net d E , 2 , 2
as ( ) C = = = = Ed Ed x EEd Ed E Ed d E
thence
( ) ( ) ( ) ( ) ( )
2 2 2
, , w x Net E d E w x net d E + = (2)
As result when ( ) ( )
2
, w x Net d E is minimized, namely ( ) ( )
2
, min : w x Net d E w
w
opt
is
searched, then ( ) ( ) x d E w x Net
opt
= , , making the second term zero in (2).
Nevertheless instead of minimizing ( ) ( ) w x Net d E , , we can only minimize the
empirical quadratic error ( ) ( ) ( )
2
1
,
1
=
=
N
k
k k
w x Net d
K
w J over a training set of size K
given as
( )
( ) { } K k d x
k k
k
, , 1 , , : = = t
Renoting ( ) ( ) ( ) w x Net d E w R , : = we want to ascertain what is the relationship
between ( ) w J
w
min and ( ) w R
w
min or between ( )
opt
w R and
( )
( )
k
opt
w J . Here the rotation
( ) k
opt
w is used to indicate that
( ) k
opt
w is obtained as minimizing ( ) w J over a finite
training set of length K .
In other words we wish to find out that what size of the training set provides enough
information to approximate ( )
opt
w R .
The biasvariance dilemma
Let us investigate the difference
( )
( ) ( )
2
,
k
opt
w x Net d E where
( ) k
opt
w is obtained by
minimizing the empirical error ( ) w J . One can write then
( )
( ) ( ) ( ) ( )
( )
( ) ( )
( ) ( ) ( )
( )
( ) ( )
2
2
2 2
, , ,
, , , ,
k
opt opt opt
k
opt opt opt
k
opt
w x Net w x Net E w x Net d E
w x Net w x Net w x Net d E w x Net d E
+ =
= + =
Remark:
The other terms in the expression above becomes zero in a similar manner as was
demonstrated in (xxx).
The first term in the expression above is the approximation error between ( ) x F and
( ) w x Net , , whereas the second one is the error resulting from the finite training set.
Thus, one can choose between two options:
 either minimizing the first term (which is referred to as bias) with a
relatively large size network, but in this case with a limited size training set the
weights cannot be trained correctly learning the second term large
 or minimizing the second term (called variance) which needs small size
network if the size of the training set is finite but that learns the first term large
As a conclusion there is a dilemma between bias and variance. This gives rise to the
question, how to set the size of the training set which strike a good balance between the
bias and variance.
The optimal size of the training set the VC dimension
From the erged hypothesis if follows that
0
, 0 , 0 JK W we > > o c for which
( )
( )
( ) ( ) o c < > w J w R P
k
if
0
K K >
Unfortunately this
0
K depends on w thus we cannot ascertain a ( ) o c,
0
K , for which it
is known that ( )
( ) ( )
( ) ( ) ( )
( )
( ) o c c < 
.

\

> = > w J w R P w J w R P
k
w w
k
opt
k
opt
min min
If such ( ) o c,
0
K exists then it would yield the necessary size of the training set.
To have such a result, we have to introduce a more stringent bound on the convergence,
called uniform convergence, namely for
0
, , JK W we C > C > o c for which
( )
( )
( ) o c <

.

\

>
e
w J w R P
k
W w
sup if
0
K K > which enforces that for all other w
( )
( )
( ) ( ) o c < > w J w R P
k
if
0
K K > .
If this uniform convergence holds then the necessary size of learning set can be
estimated. Vapuik and Chervoneukis pioneered the work in revealing such bounds and
the basic parameter of this bound is called VC dimension to honour their achievements.
The VC dimension
Let us assume that we are given a ( ) w x Net , what we use for binary classification. The
VC dimension is related to the classification power of ( ) w x Net , . More precisely,
given the set of dichotomies expanded by ( ) w x Net , as
( ) ( )
( )
( ) {
( ) ( ) ( ) ( ) ( )
} C = = e
= e = e =
0 1 0 1 0
1
,
0 , 1 , : , , :
X X X X x X x if
w x Net X x if w x Net W w w x Net F
and given a set of input points
( )
{ } N i x
i
, , 1 , : = = o
The VC dimension of ( ) w x Net , is defined as the number of possible dichotomies
expressed by ( ) w x Net , on o . If all
o
2 number of dichotomies can be expressed by
( ) w x Net , , then we say o is shattered by F .
E.g. Let us consider the following elementary mapping ( ) { } b x w w x Net
T
+ = sgn ,
generated by a single neuron.
Its VC dimension is 1 + N as if 2 = N only 3 2 = +b points can be separated on a D 2
plane.
Distribution free bounds on the convergence rate over the training set.
Let us choose the error function as an indicator function
( ) ( )
( )
)
`
=
=
wise other
d w x Net if
w x Net d
1
, 0
, ,
Then ( ) ( ) ( ) w Perror w x Net d EL = , ,
Whereas the empirical error ( ) ( ) ( ) ( ) x w x Net d
K
w J
K
k
t = I =
=1
,
1
is the relative
frequency (the error rate) ( ) w t .
Then (Vapuik, 1982) the following bound holds
( ) ( )
K
vc
w
e
vc
ek
w w P P
2 2
sup
c
c t

.

\

<

.

\

> where VC is the VC dimension of ( ) w x Net , .
Since this guarantees uniform convergence by setting
0
2
2
K
vc
e
vc
ek
c
o

.

\

= . Therefore
( ) o c,
0
K is implicitly given. One must note that
0
K depends on the VC dimension. If
the VC dimension is finite than