You are on page 1of 52

On Approximate Learning by Multi-layered Feedforward Circuits

Bhaskar DasGupta

Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607-7053, USA

Barbara Hammer
Department of Mathematics/Computer Science, University of Osnabr uck, Germany

Abstract We deal with the problem of efcient learning of feedforward neural networks. First, we consider the objective to maximize the ratio of correctly classied points compared to the size of the training set. We show that it is NP-hard to approximate the ratio within some constant relative error if architectures with varying input dimension, one hidden layer, and two hidden neurons are considered where the activation function in the hidden layer is the sigmoid function, and the situation of epsilon-separation is assumed, or the activation function is the semilinear function. For single hidden layer threshold networks with varying input dimension and hidden neurons, approximation within a relative error depending on is NP-hard even if restricted to situations where the number of examples is limited with respect to . Afterwards, we consider the objective to minimize the failure ratio in the presence of misclassication errors. We show that it is NP-hard to approximate the failure ratio within any positive constant for a multilayered threshold network with varying input dimension and a xed number of neurons in the hidden layer if the thresholds of the neurons in the rst hidden layer are zero. Furthermore, even obtaining weak approximations is almost NP-hard in the same situation. Key words: Neural Networks, Loading Problem, NP-hardness, Approximation ACM F.1.1, F.1.3; MSC 68Q17

Preprint submitted to Elsevier Science

An extended abstract of this paper appeared in International Conference on Algorithmic Learning Theory, December 2000, pp. 264-278. Email addresses: dasgupta@cs.uic.edu (Bhaskar DasGupta), hammer@informatik.uni-osnabrueck.de (Barbara Hammer). Research supported by NSF grants CCR-0296041, CCR-0208749 and CCR-0206795.

18 February 2003

1 Introduction Neural networks are a well established learning mechanism which offer a very simple method of learning an unknown hypothesis when some examples are given. In addition to their success in various areas of application, the possibility of massive parallelism and their noise and fault tolerance are offered as a justication for their use. Commonly, they are trained very successfully with some modication of the backpropagation algorithm [36]. However, the inherent complexity of training neural network is till now an open problem for almost all practically relevant situations. In practice, a large variety of tricks and modications of the representation of the data, the neural architecture, or the training algorithm is applied in order to obtain good results [29] - the methods are mostly based on heuristics. From a theoretical point of view, the following question is not yet answered satisfactorily: in which situations is training tractable or, conversely, does require a large amount of time? Until now it is only known that training a xed network, as it appears in practice, is at least decidable assuming that the so called Schanuel conjecture holds [25]. In other words, till now it is only proved (up to some conjecture in pure mathematics) that training of standard neural networks can be performed on a computer in principle, but no bounds on the required amount of time have been derived in general. People have to rely on heuristics in order to design the training problems to ensure that training a neural network succeeds. In order to obtain theoretical guarantees and hints about which situations may cause troubles, researchers have turned to simpler situations in which theoretical results can be obtained. The main purpose of this paper is to consider situations which are closer to the training problems as they occur in practice. In order to state the problems we are interested in, consider a standard feedforward neural network. Such a network consists of neurons connected in a directed acyclic graph. The overall behavior is determined by the architecture and the network parameters . Given a pattern set , i.e. a collection of points or training examples, and their labelings , we want to learn the regularity consistent with the mapping of the points to the points with such a network . Frequently, this is performed by rst choosing an architecture which computes a function depending on . In a second step, the parameters are chosen such that holds for every training pattern . The loading problem is to nd weights for such that these equalities hold for every pattern in . The decision version of the loading problem is to decide (rather than to nd the weights) whether such weights exist that load onto . Obviously, nding optimal weights is at least as hard as the decision version of the loading problem.
 (   $ !   A'"@&    '!   $ !   %#"    &   '!978$26542   3  0 1) $  3  &542 0 1) 

We will refer to both the point and the point together with its labeling , as a point or a training example. Note that an example may occur more than once in , i.e. the multiplicity of a point may be larger than .

G ED B 2F AC

G  FD  AC E B

 B

 B

( H

Researchers have considered specic architectures for neural nets in which the so called activation function in the architecture is the threshold activation function, a particularly simple function. This function captures the asymptotic behavior of many common activation functions including the sigmoidal function, but it does not share other properties (such as differentiability). It has been shown that for every xed threshold architecture training can be performed in polynomial time [14,16,26]. Starting with the work of Judd [20], researchers have considered situations where only architectural parameters are allowed to vary from one training instance to the next training instance in order to take into account that most existing training algorithms are uniform with respect to the architecture. That implies that common learning algorithms do not rely on the number of neurons which are considered in the specic setting. Hence the complexity of the training problem should scale well with respect to the number of neurons. It turns out that the training problem is NP-hard in several situations, i.e., the respective problems are infeasible (under the standard complexity-theoretic assumption of P NP [15]): Blum and Rivest [9] showed that a varying input dimension yields the NP-hardness of the training problem for architectures with only two hidden neurons using the threshold activation functions. The approaches in [16,28] generalize this result to multilayered threshold networks. Investigation has been done to get around this boundary of NP-hardness of the training problem by using activation functions different from the threshold activation function. In fact, for some strange activation functions (which are not likely to be used in practice at all) or a setting where the number of examples and the number of hidden neurons coincide, loadability is trivial [32]. References [14,17,19,31,35] constitute approaches in order to generalize the NP-hardness result of Blum and Rivest to architectures with a continuous or the standard sigmoidal activation functions. Hence nding an optimum weight setting in a concrete learning task captured by the above settings may require a large amount of time. However, most works in this area deal either with only very restricted architectures (e.g. only three neurons), an activation function not used in practice (e.g. the threshold function), or, generally, a training problem which is, in some sense, too strict compared to practical training situations. Naturally, the constraint of the loading problem that all the examples must be satised is too strict. In a practical situation, one would be satised if a large fraction (but not necessarily all) of the examples can be loaded. Moreover, in the context of agnostic learning [34], a situation in which the neural architecture may not model the underlying regularity precisely, it may be possible that there are no choices for the weights that load a given set of examples. In structural complexity theory, it is common to investigate the possibility of proving NP-hardness of the decision versions of general optimization problems [15] and, moreover, the possibility of designing approximation algorithms for the original optimization problem together with guarantees on the quality of the approximate solutions returned by such an algorithm [13,15]. A list of results concerning the complexity and approximability of various problems can be found, for example, in [3], where it can be seen that there are problems which can be approxi3
I7

mated to within a high accuracy in polynomial time even though the problem itself is NP-complete. From these motivations, researchers have considered a modied version of the loading problem where the number of correctly classied points is to be maximized. References [1,2,18] consider the complexity of training single neurons with the threshold activation with some error ratio. The authors in [4] deal with depth threshold networks. Formally, the maximization version of the loading problem (e.g., see [4]) deals with a function which is to be maximized: computes the number of points in the training set (counted with their multiplicities), such that , divided by the size of the training set . That is, we would like to satisfy the largest possible fraction of the given collection of examples. We will consider this objective rst and obtain NP-hardness results for approximately minimizing the relative error of which deal with various more realistic activation functions and situations compared to [4]. In the second part of the paper, we consider another possible function for minimization which is used in [2] and which is more relevant in the context where a signicant number of examples are not classied correctly. This is called the failure ratio of the network, i.e., the ratio of the number of misclassications (again counted with their multiplicities) produced by the learning algorithm to that of an optimal algorithm. We will obtain results of NP-hardness or almost NP-hardness, respectively, of approximating this failure ratio for multilayered threshold networks in the second part of the paper. The organization of the rest of the paper is as follows: First, in Section 2 we dene the basic model and notations. Next, we consider the complexity of minimizing of a neural network within some error bound. the relative error of the function For this purpose, following the approach in [4], we show in Section 3.1 that a certain type of reduction from the MAX- -cut problem to the loading problem constitutes an L-reduction, thereby preserving the NP-hardness of approximation. Afterwards we apply this method to several situations as stated below. We show, as already shown in [4] except when , that it is NP-hard to approximate within a relative error smaller than and multilayer threshold networks with varying input dimension and a xed number of neurons in the rst hidden layer for any . In Section 3.2.1, we show that, for architectures with one hidden layer and two hidden neurons (the classical case considered by [9]), approximation of with relative error smaller than is NPhard even if either (a) , the threshold activation function in the hidden layer is substituted by the classical sigmoid function, and the situation of -separation in the output is considered, or (b) and the threshold activation function is substituted by the semilinear activation function commonly used in the neural net literature (e.g., see [6,11,14,22]). As in [4] the above reductions use example sets where some of the examples occur more than once. In Section 3.3, we discuss how these multiplicities can be avoided. In Section 3.4, we consider the situation where the number of examples is restricted with respect to the number of hidden neurons, and show that for a single hidden layer threshold network with varying input dimension and hidden neurons, approximating within a relative error 4
 b S dR $ b f b q f b q f x w v b s q h f 7  'uix  ( uux  uty##P  utrpigXe P c7  b h uif S TR P  b a s tuuP S dR 7 S `R P P 7 u't a   S TR S TR ! 7 $  3  YX&"W2 Q 0 V) P S UR

smaller than , for some positive constant , is NP-hard even if restricted to situations where the number of examples is between and . In the remaining part of the paper, we consider the objective to minimize the failure ratio in the presence of misclassication errors (e.g., see [1,2]) and show that it is NP-hard to approximate within any constant for a multilayered threshold network with varying input dimension and a xed number of neurons in the rst hidden layer if the thresholds of the neurons in the rst hidden layer are zero. Assuming holds , a conjecture in structural complexity theory that NP DTIME poly [3], we show that approximating in the presence of errors for a multilayered threshold network with varying input dimension and a xed number of neurons in the rst hidden layer, in which the thresholds of the neurons in the rst hidden layer are xed to , within a factor of , denoting the varying number of input neurons in the respective instance, is not possible in polynomial time, where is any xed constant. Finally, we conclude in Section 5 with some open problems worth investigating further.
h xw f gR z d ea a y f h cd b v t @rFnk P v u sq om fTR $ p o mk v Fnlj b r a h t f dR I i

2 The Basic Model and Notations The architecture of a feedforward net is described by a directed acyclic interconnection graph and the activation functions of the neurons. A neuron (processor or node) of the network computes a function
x  iyA ~ e  | } { g4z y y

of its inputs . The term is called the activation of the neuron . The inputs are either external (i.e., representing the input data) or internal (i.e., representing the outputs of the immediate predecessors of ). The coefcients (resp. ) are the weights (resp. threshold) of neuron , and the function is the activation function of . The output of a designated neuron provides the output of the network. An architecture species the interconnection graph and the s of each neuron, but not the actual numerical values of the weights or thresholds. The depth of a feedforward net is the length (number of neurons) of the longest path in the acyclic interconnection graph. The depth of a neuron is the length of the longest path in the graph which ends in that neuron. A layered feedforward neural net is one in which neurons at depth are connected only to neurons at depth , and all inputs are provided to neurons at depth only. A layered net is a layered net with neurons at depth where is the number of inputs. Note that we assume in the following. Nodes at depth , for , are
This assumption is referred to as being almost NP-hardness in the literature (e.g., see [2]).
  $ "3  "r b 3 b 3 b f x xf z y b f x A f  ~  | f7 b  b y |  1 3 "3 

called hidden neurons, and all neurons at depth , for a particular with , constitute the hidden layer. For simplicity, we will sometimes refer to the inputs as input neurons. To emphasize the selection of activation functions we introduce the concept of nets for a class of activation functions. A -net is a feedforward neural net in which only functions in are assigned to neurons. We assume that each function in is dened on some subset of . Hence each architecture of a -net denes a behavior function that maps from the real weights (corresponding to all the weights and thresholds of the underlying directed acyclic graph) and the inputs into an output value. We denote such a behavior as the function Some popular choices of the activation functions are the threshold activation function H
f ! ! b t

if

otherwise

and the standard sigmoid


f 7 $ pA t

In the learning context, the loading problem (e.g., see [14]) is dened as follows: Given an architecture and a collection of training examples (we allow multiplicities, i.e., an example may be contained more than once in a training set), nd weights so that for all pairs :
v $ !   w"r Q  $ !   "    ! 7 $  3  &&542 0 1)

Note that both, the architecture and the training set, are part of the input in general. In this paper we will deal with classication tasks where instead of . Clearly, the NP-hardness results obtained with this restriction will be valid in the unrestricted case also. An example is called a positive example if , otherwise it is called a negative example. An example is misclassied by the network if , otherwise it is classied correctly. In general, a maximization (minimization) problem is characterized by a nonnegative cost function , where is an input instance of the problem, is a solution for , and is the cost of the solution ; the goal of such a problem is to maximize (minimize) for any particular . Denote by opt (or simply by opt if the problem is clear from the context) the maximum (min. The two objectives that are of relevance to this paper are imum) value of as follows (assume that is the instance (architecture and training set) and is a 6
$ 2 ! $ ! 3  "2&dR $ ! 3  "%dR $ ! 3  p2&dR $ ! 3  p"%&dR $ 2 f 3 '#! $ !   " ! I7 $  3  &"W2 0 1) f 7 ! WT!

v &

0 )

x #8f

0 )

7 $ &%

solution (values of weights) for ):


S `R f dR f dR

Success ratio function


7 $ ! 3  S "%UR

:
 ! 7 $  3  '&%654r 0 V)  &

number of examples such that size of


 

(e.g., see [4]). Note that the examples are counted with multiplicities if they are contained in more than once. In other words, is the fraction of the correctly classied points compared to all points in a training set. Notice that opt holds for all instances . The relative error of a solution is the quantity opt opt . Our interest in this paper lies in investigating the complexity of nding a solution such that the relative error is bounded from above by some constant. : Failure ratio function Dene by the number of examples such that . Then, provided opt
$ 2 h $ ! 3  R 7 $ ! 3  f "%`"2`R '!I%"r 7 $  3  0 V) $ ! 3  "%dR ! S TR $ % S hr$$!"32SUR$% S  f $ d% S

opt

(e.g., see [2]). That is, is the ratio of the number of misclassied points by the given network to the minimum possible number of misclassications when at least one misclassication is unavoidable. Our interest in this paper lies in investigating the complexity of nding a solution such that is smaller than some value. The NP-hardness of approximately optimizing these objctives will be the topic of this paper. For convenience, we repeat the formal denition of maximization and minimization problems as introduced above and the notion of NP-hardness within this context: Denition 1 A maximization or minimization problem , respectively, consists of a set of instances , a set of possible solutions for each , and a cost function ( denoting the positive real numbers) which computes the cost of a solution for an instance of the problem. We assume that for each instance a solution with optimum, i.e. maximum or minimum value, respectively, exists. Denote by opt the respective optimum value, i.e. opt if is a maximization problem and opt if is a minimization problem, respectively. Assume is some constant. Then approximating the relative error of the maximization problem within the constant is NP-hard if every problem in NP can be reduced in polynomial time to the following problem: given an instance of , nd a solution such that the relative error can be limited by opt opt . 7
$ 2  a a2 $ ! f 3 ga hr$$!"32&`R 9 $ % 5$2p"%`i&2 ! $ ! 3  R 7 $ $%g$p"3%`RiV#g2 ! ! 7 $ $p"3%TR !  $  % ! ! 8 $  $ ! 3 g%4p"% dR f R

h$%  &

(counted with multiplicities) ,

Assume is a constant. Assume is a minimization problem where by denition of the fact opt holds for all instances . Then approximation of the relative cost of the minimization problem within constant is NP-hard if every problem in NP can be reduced in polynomial time to the following problem: given an instance of , nd a solution such that the costs can be limited by opt . In our case instances are given by neural architectures and training sets and solutions are possible choices of the weights of the neural architecture. As already mentioned, we will deal with the following two objectives: minimizing the failure ratio function and minimizing the relative error of the success ration function , respectively. Note that both objectives, minimizing the relative error and minimizing the misclassication ratio, respectively, are dened via the cost function or , respectively, of the underlying maximization or minimization problem, respectively. We will in the following refer to the above notation of NP-hardness as the NP-hardness of approximating or the respective cost, respectively. Depending on the minimum number of misclassications that are unavoidable in a training scenario, the two objectives we are interested in can be related. Assume an input instance of a training scenario and a solution are given. Denote by the size of the given training set, by the maximum number of points which can be classied correctly, and assume . Assume the number of points classied correctly by is . Then the two objectives can be related to each other as demonstrated below: Assume that the relative error of the success ratio function is smaller than some value . Hence . As a consequence, the failure ratio can be limited by . If a large number of errors is unavoidable, i.e. is much smaller than , the term is small. I.e. in this case bounds on the relative error of the success ratio can be transformed to small bounds on the failure ratio function. Conversely, bounds on the relative error of the success ratio lead to only very weak bounds on the failure ratio function if only a small number of points is necessarily misclassied and opt approaches , hence the factor is very large. Assume conversely that the failure ratio is limited by . Hence . Then we can bound the relative error of the success ratio by the inequality . The value is small if is close to and it is large if is much smaller than . Hence we obtain small bounds on the relative error of the success ratio if the number of unavoidable misclassications is small. We obtain only weak bounds from the above argument if the number of unavoidable misclassications is large. 8
 rg  h $  e#ry#X f h 9 pi  y# h $ f  $  h $ e#i4e#X4e#uwy#r $ ri $  h idpy#  p# $  h e#i gxfg$i`phr$$Tf&y#`i`prt` 7  $  h $  $ f  h # f 3 yA  h rp1#  $ i  !  p# p# ! f dR S `R `R S UR a ! h $ % ax2 $ dR f h a h $ ! 3  "2&`R

In this paper, we will consider the complexity of nding approximate optima for these two functions. Note, however, that training sets for neural network architectures have been dened above over the real numbers. We will in the following restrict to representations over only and we will assume that the numbers are represented in the standard way. Note that there exist alternative notations for computation over the real numbers which will not be the subject of this article [10].

3 Hardness of Approximating the Success Ratio Function We want to show that in several situations it is difcult to approximately minimize the relative error of for a loading problem . These results would extend and generalize the results of Bartlett and Ben-David [4] to more complex activation functions and several realistic situations.
Q S `R

3.1 A General Theorem First, an L-reduction from the so-called MAX- -cut problem to the loading problem is constructed. This reduction shows the NP-hardness of approximability of the latter problem since it is known that approximating the MAX- -cut problem is NP-hard and an L-reduction preserves approximability. Formally, the MAX- -cut problem is dened as follows. Denition 2 Given an undirected graph and a positive integer , the MAX- -cut problem is to nd a function , such that the ratio is maximized. The set of nodes in which are mapped to in this setting is called the cut. The edges in the graph for which and are contained in the same cut are called monochromatic; all other edges are called bichromatic. Theorem 3 [21] It is NP-hard to approximate the MAX- -cut problem within relative error smaller than for any . The concept of an -reduction was dened by Papadimitriou and Yannakakis [27]. The denition stated below is a slightly modied version of [27] (allowing an additional parameter ) that will be useful for our purposes. Denition 4 An L-reduction from a maximization problem to a maximization problem consists of two polynomial time computable functions and , and two constants and a parameter with the following properties:
( (     Q a P a $ $ f a  h x@1urpif P a $ y 3 y #"#r a 3 3 P 3 f Vyt 'y h $ y  I7 $  d x6y2cr $ 3  7 5@g y $y32y  a a a a

(a) For each instance

of

, algorithm 9

produces an instance

of

3 8

(b) The maxima of and , opt and opt , respectively, satisfy opt opt . (c) Given any solution of the instance of with cost such that the relative error of is smaller than , algorithm produces a solution of with cost satisfying opt opt . The following observation is easy. -reduces to with constants , and parameter Observation 5 Assume that . Then, if approximation of with relative error smaller than is NP-hard, then approximation of with relative error smaller than is also NP-hard. Since the reductions for various types of network architectures are very similar, we rst state a general theorem following the approach in [4]. For a vector , occupies position of and is referred to as the component (or, component ) of in the following discussions. Consider an -reduction from the MAX- -cut problem to the loading problem with success ratio function satisfying the following additional properties. Given an instance of the MAX- -cut problem, assume that produces in polynomial time an instance (a specic architecture and a training set with examples where , and is polynomial in ) of the loading problem where the points are of the following form: copies of each of some set of special points (e.g. the origin), for each node , copies of one point , where is the degree of for each edge , one point .
(     p  4"3 #2 $ y  y  4#y  `nVP Q x `nt`  b a a  ! ( v  S dR $ 3  5@7  Q $ !   " Q 7    ( v  $ 3 3  2 3 ) (  Q  ( ) ( $ r  Q  ( ( $ r ( ( ( ( $ 6$ 2 ( ( $  r  ) 4$  6$  r (   ( $  r 

Furthermore, assume that the reduction performed by following properties:




and

also satised the

(i) For an optimum solution for we can nd using algorithm an optimum solution of the instance of the corresponding loading problem in which all special points and all points are correctly classied and exactly those points are misclassied which correspond to a monochromatic edge in an optimal solution of . (ii) For any approximate solution of the instance of the loading problem which classies all special points in the set correctly, we can use the algorithm to compute in polynomial time an approximate solution of the instance of the MAX- -cut problem such that for every monochromatic edge in this is misclassied. solution, either , , or
$ $ x P u V@ ( @prix@ a h $ f a 7 Q

Theorem 6 The reduction described above is an -reduction with constants , , and parameter for any 10
x P u V7 ) $ f a h xppa

$ y  y 3 tr

 #y

$ y 3 y "@tr

 y

6

&%

W$ C #

1 C ) ( G  2"0'

 

PROOF. Let opt and opt be the optimal values of the success ratio function of the instances and . Remember that it is trivially true that opt . Hence, because of (i),

Consider a very simple randomized algorithm in which a vertex is placed in any one of the partitions with equal probability. Then, the expected value of the ratio is . Hence, is at least .
G

C '

x'n1P  x  'n1P $ f y athfXxunVP  f x  'n1P x `n' `nVP x $ f  d ii' dn1P 

$  r

Next we show that can be chosen as provided . Assume that we are given an approximate solution of the instance loading problem with cost .

f Wd

opt

$  2

x P x  `n`nVun`nVP

$ a h f  tif

( $ r

 $  r

opt

opt

opt

$  r

$ x  P  $ f a ndnV'ndn1@ dn

$  r

$ a h f f  $ x  P ptxnd #nd V@ $pathifxf`n`nVun`nVP x P x  x  `n'n`nVP

opt

$  2

f a a f xa a {

( 7 $ r

Hence,

Assume that the relative error of the given solution is smaller than . Then we can limit the cost of the solution by opt due to the denition of the relative error. Hence the solution must classify all special points from the set correctly. Assume, for the sake of contradiction, that this is not true. Then,

XxunVe  P $ f

( $ r

$ f e

( x  P unV ( piWw a h $ f a

x  '6nVP

for If all the special points from the set have

6 $ $ a h f x  P  a  h $ f a P  puiX#nV@ixt 7 (

$ x  P  h $ a h #n1@prptif

can be chosen as

$ f a h xppa

opt

opt

( ( $ 6$ 2

denoting the cost of the solution of

`nu `nVP x  6$  2

opt

. Hence,

. are classied correctly, then, by (ii), we

can be chosen as

of the

x lu VP

11
G
&%

W$ C #

"!

 

C G 

  

C  G 

( ( 3 $ 6$ r

$u V@7 x P `n

$ $

Corollary 7 Assume that a reduction as described above which fullls properties (i) and (ii) can be found. Assume that approximation of the MAX- -cut problem within relative error smaller than is NP-hard. Then approximation of the loading problem within relative error smaller than
a $ f a rx

is NP-hard.

3.2 Application to Neural Networks This result can be applied to layered H-nets directly, H being the threshold activation function, as already shown in [4]. This type of architecture is common in theoretical study of neural nets as well as in applications. As dened in Section 1, the architecture is denoted by the tuple . One can obtain the following hardness result, which can also be found in [4] (except for the case when ). Theorem 8 For any , constant and any , . . . , , it is NPhard to approximate the loading problem with instances , where is the H-net ( is xed, ,. . . , architecture of a layered may vary) and is a set of examples from , with relative error smaller than
b (
5 3 4

P 7  b

Since the case of

is not covered in [4] we provide a proof in the appendix. 12

Hence, using Theorem 6 and Observation 5, the result follows.

f $  r

 b $5d6  3 5 ( b b

$ a h f ptiXf

PROOF. Obviously, since we can assume that we can assume that . Hence an upper bound for estimated via
 $'n1@a x  P  $ f a ri a h f ti $' V@ ( a x P $ f a x

and opt , from Theorem 6 can be



Note that in particular every optimum solution must correctly classify one solution with correct exists.
6 f

if at least

$ f i3

$ %

b 3 "3

ufy#W` 3 v 3 b $ f 7i3 "3  "ry b 3 b 3 b

b "3  "r b 3 b

b f uuix b q f b q f b s  ( utx  utx w v P  uuq  f

 b

$ x  P  'n1@a  7 P 7  b
8

The previous theorem deals with multilayer threshold networks which are common in theoretical studies. However, often a continuous and differentiable activation function, instead of the threshold activation function, is used in practical applications. One very common activation function is the sigmoidal activation . Therefore it would be of interest to obtain a result for the sigmoidal activation function as well. In this section we deal with a feedforward architecture of the form where the input dimension is allowed to vary from one instance to the next instance (this is the same architecture used in [9]). The activation function of the two hidden neurons is the sigmoidal activation function. Since the network is used for classication purposes, the output activation function is the following modication of the threshold activation function
3 wg

if

otherwise .

This modication enforces that any classication is performed with a minimum separation accuracy . It is necessary to restrict the output weights, too, since otherwise any separation accuracy could be obtained by an appropriate scaling of the output weights. Therefore we restrict to solutions with output weights bounded by some constant (we term this as the weight restriction of the output weights). This setting is captured by the notion of so-called -separation of the outputs (for example, see [24]). Formally, the network computes the function
$ z x $ x ug
A 6

where are the weights and thresholds, respectively, of the output neuron and the two hidden neurons and for some positive constant . Since we deal with only those examples where , the absolute value of the activation of the output neuron has to be larger than for every pattern in the training set which is mapped correctly. Theorem 9 It is NP-hard to approximate the loading problem with relative error smaller than for the architecture of a -net with sigmoidal activation function for the two hidden neurons, activation function H in the output neuron with ( ), weight restriction of the output weights ), and examples from . ( PROOF. We use Theorem 6 and Corollary 7. Various steps in the proof are as follows: 13
9

f 3 u#

@ H

3 $ f 3 P 3 b Igb iyry

$ !   " V '3

f 3 uy#

$ F 3 B 3 3 A G%5E3 z 3  3 )

 t 59 

x $ C x pD  u B )

3 g

P Q

P P h t'tuif

7 $  3  &&542

7 $  &2"9

undened if

7 $ p% t

3 tr yr $ f 3 P 3 b $ f 3 P 3 b iyr

3.2.1 The

H -net

$ pif x f h 7
@

0 1) g@

, , , , , , ,

with labeling .

is the unit vector with labeling , is the vector with at positions and from left and otherwise
f

(2) Examination of the geometric form: First we want to see how a classication looks like. The output neuron computes the activation
z x $ C x pf  tr B ) x $ x u
A e

where and , respectively, are the weight vectors of the two hidden neurons, , are their respective thresholds, and , , and are the weights and threshold of the output neuron. First we investigate the geometric form of the output of our network when the relative error is smaller than . We will use properties of the geometric form which are not affected by the concrete parameterization of the objects, in particular, we may scale, rotate, or translate the coordinates. For this purpose we examine the set of points which form the classication boundary, i.e. the points for which
p P P h u'ttif 7 z z x $ C x Dg  tr B ) )
g

Note that we will not use the geometric form for the construction of any algorithms but only for proofs.

14

$ $ b P h cp5r

$ $ b P h f 6rc@5

x $ x ug

tr

` a

f $ $ y  y    g"3 irry $ 3 3 f 7   Uutgry

h @ s x f h uY89

with

. Furthermore,

P R

x #U

(1) Denition of the training points: maps an instance of the MAX- -cut problem with nodes and edges following loading problem. The input dimension is given by points together with their labeling are
7 b $  3 3 3 3 f S 5y55"5t3 XU S  b P 
A

to the . The

$  3 f 3 3 3 S t5#"#"553 XU S 

$  P 3 P f 3 3 3 3S S 5yQ5eWt"55XU 0

$  3 x f 3 3 3 S 5i5#&"553 XU S 

$  P 3 P f 3 3 3 S t5yQ5yQt"553 XU S 

$  3 3 3 f 3 S 5y55"t53 XU S 

$  3 3 f 3 3 S 5y55t553 XU S 

$  3 3 f 3 f 3 f S 5y55ttt3 XU S 

$ f  P 3 P 3 3 3 S S ieW"yWyy55"3 VU T

$ f  P 3 P 3 3 3 S S iyQ5yQ5y55"3 VU T

$ f  3 3 3 3 S ii5iy55"3 VU S 

$ f  3 3 3 f 3 f S S i55y5tt3 VU T

$ f  3 3 f 3 f 3 S i55ut"3 VU S 

$ f  3 3 3 3 S #"#5y55"3 VU S 

$ f  v iy'

A e

 t

 u

$ h u

 5F

holds. The set of points for which the left-hand side of the equation above is positive (resp. negative) is called the positive (resp. negative) region. If , , , or were , the output of the network would reduce to at most one hyperplane which separates the points. Obviously, would not be classied correctly in this case. Hence we will assume in the following investigation of the geometric form that these values do not vanish. Since we are only interested in the geometric form, we can substitute the current Euclidean coordinates by any coordinates which are obtained via a translation, rotation, or uniform scaling. Hence we can assume that , where is the rst component of . Assume that and are linearly dependent. Then the classication of an input depends on the value of . Due to our parameterization, only the size of the rst component of a point , the value , determines whether is contained in the positive region, the negative region, or the set of points with output activation of the network. Moreover, . Depending on the weights in the network, the positive region is separated from negative region by up to three parallel hyperplanes of the form where is a solution of the equality and denotes the vector space of vectors which are orthogonal to . In order to show this claim, we have to show that the above function yields zero for at most three points . Remember that we assumed that , , and do not vanish. Assume for the sake of contradiction that the function equals for four points . These points fulll the equality . Since the mapping is monotonic, we can identify a connected open interval for where is contained in and hence the above equality can possibly be solved. Within this interval we can consider the function which then equals the identity for four points. Hence the derivative of the function equals for at least three different points within this interval. Using the equality for we can compute the derivative as . If we set this term as , we obtain a quadratic equation for the term and hence at most two different values such that the derivative equals , hence at most values where the above function equals . On the other hand, if and are linearly independent then forms an dimensional manifold because it is dened as zero set of a function where the Jacobian has full rank. For a denition of manifolds and related terms such as the tangential bundle, curves, and convexity see e.g. [12,33]. The manifold has a very specic form: The tangential space for a point consists of the directions which are orthogonal to the Jacobian at point of the above function. Hence the vector space is always included in the tangential space where as above denotes the vectors which are perpendicular to and denotes the vectors perpendicular to . If a point is contained in then every point for is contained in , too. Note that every vector 15
x  6 tr A f gb 
g g

x $ C x C x $ z 5pD  i tr ) A tr f $ C x 4  i u C $ $ $ C x 4  i t ps  i t C C  ) $ $ C x ) x z x U  i C tr  F )  7$ u y f  $ h '$ z px  v tr $ C x C  )   t

$ h u

h ur$ z

f 3

$ C x p7X  i tr C  7 $ ) gpA t

B 7A q

) x$pCtx  i tr C )

$ 3 3 F "53  i

q A rp

Y7 z 6ps  i tr x $ C x C ) x $ 3 3 8552

uh$ z

$ h 6u$ z Vpw  v t $ C x C  )   t

B q A

7 4B

f x   h $ $ $ C x rprps  i tr pC C f $ f 3 A! $ f y h f  iiw

x$p% t

$pCpxT  i C tr )  $ C x 5pu  i tr C

q A

x $ C x B x $ z p  u ) u

x 

7 x  x

gg

x $ p% u

A A

can be uniquely decomposed into where denotes a vector in the plane spanned by and , and denotes a vector in the orthogonal space . Note that only the rst part, determines whether is contained in , the positive region, or the negative region. Hence we can entirely describe the classication given by the network if we only consider the projection of the points to the plane spanned by and . Hence we can in the following restrict our investigation to the one-dimensional manifold which is obtained if we project onto the plane spanned by and . This one-dimensional manifold entirely determines the classication of points by the network. Next we show that this projection is a simple curve and we derive an explicit parameterization for the curve and a normal vector eld to the curve (i.e. a vector eld of vectors which are orthogonal to the respective tangent.) Assume is an element of the one-dimensional manifold. As above we assume a parameterization of the manifold such that . If is chosen then we can uniquely determine because of where this value is dened, i.e. for if , or if , respectively. Hence we nd the parameterization of for where and where we set if and if . Note that the components of this mapping are both monotonic functions. From we obtain the parameterization of the curve. We refer to this curve as the curve which describes . In particular, this constitutes a simple connected curve parameterized by because the mapping is continuous and the function is obviously injective. A normal vector along the curve can be parameterized by , the Jacobian. The term can again be substituted using the equality . We obtain
Bc ) $  % tr z f { $  % u z U!6$  % 0t $  %b $  x A q 7  $ h uf $ C x s  rtr B B $ C px f x B q q rtr ) x A %$ 7   A Ttr 9$  %b q 7   A   g B $ ( B e $ A ( ( o5pB A ( oB WXpy$ ( B 6 ( ( o5pB nB ( mA   B A  h $ A  A  x A $ A  B A  h $ A  B  $   e B 3 A f i l5v 7 $ h t k k $ 5uh   i j8v 7 $ h z   t '$ ) z   tr V#9 3 $ h 7 tr $ 6uh z   tr 6u$  3g F2`  $   6 B 3 A 3 $ h ) z r  u 8g 7 $r$ $ h  z 1$  % u wr t 3  2  ) ur$ h z 'h z  u$  h z i8uh z #$  A t  3 'h z ur$ h z  ) ) ) uh 3 h  $ h uf z 8'$ ) z F$  % tr $ hr$ z $  % t wr t  B  7  )  x xW e7  A 
g

Now considering in more detail the four values , , , and several cases can be distinguished for the curve which describes if and are linearly independent. If and are linearly dependent, at most three parallel hyperplanes separate both regions. Case 1: All values are or all values are . Then the activation of the network is positive for every input or negative for every input. In particular, there exists at least one point in which is not classied correctly. Case 2: One value is , the others are . since . We may have to We can assume that change the sign of the weights and the thresholds beforehand without affecting 16
$ p% u f 7 $ p u 6 h z h x x z
B A

x z

x z

$  2b V$  %$  2t b  h  b 7 

Dene by

the normalization of .

 x x


B IdA q

v 
g

b/|b|

a/|a|

x1

Fig. 1. Classication in the second case with respect to the two relevant dimensions in the general setting, i.e. and are linearly independent. The positive region is convex as can be shown considering the vector which is orthogonal to the manifold.
w

the activation due to the symmetries of the sigmoidal activation function: if is positive, we substitute by , by , by , and by ; if is positive, we substitute by , by , by , and by ; if is positive we compose the above two changes. If and are linearly dependent, at most three parallel hyperplanes with normal vector separate the positive and negative region. The number of hyperplanes is determined by the number of points for which yields . If then the above function is strictly monotonically decreasing with , hence at most one point can be observed and the positive region is separated from the negative region by one hyperplane. If , we nd and hence the function can have at most two points with value . The positive region is empty or convex and separated from the negative region by two parallel hyperplanes with normal vector . Assume that and are linearly independent. We nd , and . Dividing by we obtain , , and . The curve describing looks like depicted in Fig. 1, in particular, the positive region is convex, as can be shown as follows: The normal vector decomposes into a combination where and and as above. and are in Case 2 both negative because and are both negative. We now show the convexity. First we show that the curve describing is convex. Assume for contradiction that it was not convex. Then there would exist at least two points on the curve with identical normal vector , identical coefcients and because and are linearly independent, i.e. identical . Consequently, there would exist at least one point with . Note that is obviously differentiable, though is not, since the ratio is a constant . One can compute where . If was
( $  2 q $ Yh  6 ~ ~ I7 $ ( $  A $ x x f $  ( r$  t ) 7$  2  $ $ x ( u$  % ( x P x $ P ux t ( ) ) 78$  % q $ Yh  6 ~ ~ f $ ( ) ~ 3g FgY  ~ $  t h h $ f  f $ $ f  7 g'$  t h $ C x B  AV$ r$  %  A  A1p f q tr ) u t ( ) ( 7 $  2 ~ i$  A1$  % q t $  2  ~  t h 7 B T$ ~ x A  2 $  %  ~ tb x F 5x z x ) z x ) x z x  z 9 c
C v

17

 t )

x $ p% tr

f Y z

z 9 7 x $ C x C x $ z x z t  i u ) pA t 4| { d ) x $ C x 8p}  i u C x $ | { z  F ) p% u 4$%d

A x

B y

f Y

z x ) x z

f 7

~ Y

h 

A 8pr$ f   h t P $  % x f t )  h  2b Vni$  %b X$  7 ( h  ~

~ Y

$ h uf

 % 0t ) q  '$  % 

2 q $

~ Y

x $ pC
g

h 

~ e

a/|a|

a/|a|

x1

Fig. 2. Classication in the third case with respect to the two relevant dimensions in the general setting. The vector approaches in the limits.
f g7 
A F A

where the term the square root is taken from is negative except for or because , , and are negative. Hence the curve is convex unless or equals which cannot take place by assumption. Hence the positive region or the negative region is convex. We show that the negative region is not convex, hence the positive region is necessarily convex: since and are linearly independent, we can assume w.l.o.g. that the second component of does not vanish. (For other nonvanishing components the argumentation is analogous.) It holds , , and . Choose such that . Consider the line . Thereby only the rst two coefcients are nonvanishing. We nd for the point where a positive activation of the network by assumption on , whereas yields the activation of the network and yields . We nally compute the limits of if approaches the borders of the interval as dened above. Because , we nd . We can compute using the above parameterization of : and . Hence in this case the positive region is convex with normal vector approaching a parallel vector to and , respectively. For every point on the curve the normal vector is parallel to a convex combination of and . Moreover, the coefcients and dened in are strictly monotonic functions. Note that a convex manifold coincides with the intersection of its tangential hyperplanes. Case 3: Two values are , two values are . We have already seen that if and are linearly dependent, at most three parallel hyperplanes separate the positive and negative regions. Assume and are linearly independent. We can assume that the nonnegative values are and . We may have to change the role of the two hidden neurons and or the signs of the weights and the thresholds beforehand: if are nonnegative then we substitute by , by , by , and by 18
 z XfQi3ry Q f 7
~

f 7 c

$ $ f x riUr

f xx ) g

f h h '$

B h B Vm1y

A j

( $tPx x `r ( ) x ( iUr$ $ f x  

( ( g58$533"3 $  fite"53 crf6$  F F h F f  3 f  x $ 3 3 F h $ F 6 u h f x $ x $ 6 u h Q ) f (

7 $  %|  tb

f 

i j

B y

x f X

hui h f 

z d

x f 8

A x

tb f x x`

A h A V6x

$ P tx

x f X

x  `

6w p z 7$  %t b f6 j w uo t b E$uh tr "F2 i 3 f 7 3g 3g F

) $ x f 6

7 &$  A tr

z x )

$ h h un

7 W
~

or

u na# u

and are nonnegative then we substitute and by and , respectively, by , by , by , by and by ; if and , or and are nonnegative we change the role of the two hidden neurons and end up with one of the rst two situations. Note that and cannot both be nonnegative in this situation unless all four values equal which is Case . Moreover, we can assume that at least one value is nonzero, otherwise Case would take place. We can as before consider a normal vector of the curve describing , . The limits of the parameter yield for the value (or if ), and (or if ). For the limits are (or if ), and (or if ). One can compute for all limits with nite bounds and the normal vector . For the limits where or equals we can consider the fraction of the coefcients of the linear combination of and obtain for , the same vector is obtained for and . We obtain if and the same vector is obtained for and . For values in between we can as before consider the fraction of the parameterization of a normal vector and its derivative to estimate the overall geometric form. We obtain a formula analogous to for possible points zero for the derivative of this function: . This is constant if and or in which cases the normal vector equals or , respectively. Otherwise, we obtain at most one possible solution if . If we obtain
7 z x ) 7 x z x ) $6uh 7 z   t g b g ( ~ Yh  ~ iw 7 z T x | z 6w d B x A h $ B A  7  tb z 7 z x ) VDV`x$  2&| 7 z x x d 7 z ) w 7 $  b z tb  ( 2ts| $  % c$  2  ~ ~ h i w w A h V6Axg7$  2t b zR$ 7  2  tb z 97 x z i x $ h '$ z r tr g  7 ) i $6u$ h   7 z r  t ) h  ) I 7 w

which has at most one solution because is positive by assumption hence only the solution with might yield positive values. Hence the separating curve equals either a simple line, or it has an S-shaped form with limiting normal vectors or, in the above special cases, or , respectively (see Fig. 2). We will use the following property in the latter case: the term cannot be limited from above or below, respectively, for points in if approaches the borders or , respectively. I.e. for , the term becomes arbitrary large or arbitrary small, respectively, and an analogous result holds for . Thereby, this fact follows from the above limits because and are linearly independent. Hence a unit tangential vector of the curve which is perpendicular to the normal vector can be decomposed into where is some vector perpendicular to , and , and the coefcient approaches a xed and nonvanishing limit if approaches or , respectively, due to the limits of . Case 4: values are , one value is . This case is dual to Case . We ob19
 B x A h $ B A do144j
B

5F

x 7 P z Pxx ) T z x ) xT AwV$Bx B x h x A oVrex B x A h $ B A 7 97 z dx P x 7 z x ) z ) ( ( 7 9$ ( Tx  z uPTx ) $  A ( tr $ z PTx ) z P@$  % tr x z ) z x $ h h un

$ z

e
g

B y

$ z x ) x  U$ z

$ z

x ` (

x  `$ z

A j

 z

F x 5t g  F x 5 B

 z

x x

z x ) x d f ( f
B

$ z x x  Ur ) $ z x  z )

B x A h $ B x A sd#Vr4`x A h A oVj

x B 

iw | z d B x A h $ B x A RdiVrR`x

x `

g $6uh z   tr

x )
g

7 $ &p% t

3g F2

x `

; if
)

i x

F 5x

Trivial cases

Case 2

Case 3

Case 4

Fig. 3. The different geometric cases which can occur for the form of the separating manifold of the positive and negative region.

tain the possible geometric forms by changing the output sign, i.e. the positive and negative region. Hence the negative region is convex and separated from the positive region by up to two lines or a convex curve. The normal vector of the curve can be written as a linear combination of and : where and are strictly monotonic. If approaches or , respectively, becomes parallel to or , respectively. To summarize, the classication can have one of the forms depicted in Fig. 3. (3) Only Case solves : Next we show that for any classication only in Case all the special points be classied correctly. Obviously, Case is not possible.
P f  P
B G t

can

In order to exclude Case consider the last two dimensions of special points we have constructed. The following scenario occurs (we drop the rst coefcients which are for clarity): the points , , , are mapped to and the points , , , are mapped to (see Fig. 4). They cannot be separated by at most three parallel lines. Assume for contradiction that at most three parallel lines separated the points. Then one line had to separate at least two pairs of points , or , or , or , . Since the points with second component are contained in a single line, we can assume w.l.o.g. that the line separates the second and third pair, the argumentation for the other situations is equivalent. Hence we can limit the tangent vector of the line to be contained in the sector and . Hence each of the remaining at most two lines which are parallel can only separate one of the pairs , or , or , , contradiction.
c

0.5 1c 1.5 1.5 1+c

Fig. 4. Classication problem which is contained due to the special points if restricted to the last two dimensions.

20

$ P 3 P tQ5yQr

$ P 3 P tWy5yQ

$  2

$ 3 f $ 3 x f $ P 3 P f $ P 3 P f 5# "#p tWy5yQt uW"yWu $ 3 $ 3 $ P 3 P $ P 3 P 5iw 5#r tQ5yQ tQ5yQw x xU

&

x A $  2 

$ P 3 P f uW"yWu

~ g

7 Tt b

$ 3 5iw

$ 3 f "# $ P 3 P uW"yWy

$ P 3 P f uW"yWu

$ P 3 P x uWX5yQ

$ 3 5iw

$ 3 f "#

$ 3 x f 5ix

$ P 3 P f tQ"yWugX

$ 3 5i

P Q $ 3 5i

$ P 3 P f tQ5yQt

$ P 3 P f tWy5yQt

lines perpendicular to b

airrelevant arelevant b a

Fig. 5. Outside the -relevant region, we can disregard the specic form of the contribution and substitute it by a constant. Then we obtain separating lines in the last two dimensions.

Hence, and are linearly independent which means that the separating manifold has an S-shaped form. Note that we will only use properties on the size of for points on the curve where or , respectively, in order to derive a contradiction. Dene and . The set of points and are called the - or -relevant region, respectively. Outside, or , respectively, can be substituted by a constant, the difference of the output activation being at most . More precisely, every constant or , respectively, for a xed with or , respectively, or or , respectively, will do. Note that due to the monotonicity of both components of the map for the curve will stay outside the -relevant respectively -relevant region if the curve has crossed the boundary towards the respective region. Now it is rst shown that three points forming an isosceles triangle with height at least and base length at least are contained in the -relevant region. This leads to a bound for the absolute value of . Second, if three points forming an analogous triangle are contained in the -relevant region the same argument leads to a bound for the absolute value of . Using these bounds it can be seen that neighboring points cannot be classied differently. Third, if no such triangle is contained in the -relevant region, the part does not contribute to the classication of neighboring points outside the -relevant region and the two points and or the two points and , respectively, cannot be classied differently. First step: Since the points with second component cannot be separated by one hyperplane, one point with exists inside the - and -relevant region, respectively. Assume the points and were both outside the -relevant region. Then they were contained either on different sides i.e. for one of the points and for the other point or they were both contained at the same side, e.g. for both points. (The case where holds for both points is analogous.) We rst consider the latter case (see Fig. 5): as already mentioned, we could then substitute the respective parts by the value obtained for any further point of the curve with 21
h w
A A

5Fx

$ C x p  t B

$pCx`  tr B $ 'xU e t A x  x A   s5F44 B#  4x4 # $ $ P h b@p5Xf tr 7 $ $ b P h c@p" 7 1   tr  g 

$ 3 x f "#

3Fg2 

x   h $ x u e u A

$ 3 f 5i

$ 3 5i #QteWtg P f 3 P f

P Q

4 sV x A h h x X A

F x 5u

$ F x 5

B 

3 x U

$ 3 5#w

uP

A 6

$ P 3 tQ52

$ 3 x f 5i

x 

P h t5

h x w

$ x u`

F x pg

f g

A s

s A

h 1

A e

$ 3 5i

, the difference being at most . The term is unlimited either from above or from below if approaches the respective limit of the parameterization or , as shown beforehand. Hence we can nd a point of the curve with such that the corresponding value is larger than for both, and , or it is smaller than the value for both, and . Because yields the activation and the rst part differs at most for , , and , the points and cannot be classied differently with an activation of absolute value larger than . Contradiction. If the two points and were contained in different sides outside the -relevant region then the points and were both not contained in the -relevant region and they were both contained in the same side. Hence we would obtain an analogous contradiction for these latter two points. and is contained inside The same argument shows that one of the -relevant region. Therefore the -relevant region contains an isosceles triangle with height at least and hence a circle with diameter at least . Consequently, , where . Second step: If one of the points and and one of the points and is contained in the -relevant region, we could conclude in the same way for . This leads to the following contradiction: we nd for the points and
$ 3 "#w $ 3 x f 3 3 3  5iy5r7 $ 3 x f 5i ( ( c  t   m  f   on B B x A A ) ( z 6$pfx    C B $ 6'x  e A tr ) tr A x $ C pfx   f t B x $ ) 5ux   e u ( z  $ 3 3 3 3  5i55y7   pv d v i$ F G3 F  i 7 F $ 3 5#r
B

This excludes Case . Next we exclude Case . The classication includes in the dimensions to the situation depicted in Fig. 6. At most two parallel planes cannot separate a convex negative region. To show this claim, assume for contradiction that at most two parallel planes classied the points correctly such that the negative region is convex. The points are obviously not linearly separable. Hence one plane has to separate the point (we drop all but dimensions to for clarity) , , , and , one plane separates and all points from the four negative points, one plane separates . Assume a normal vector 22
$ 3 f 3 f 5tu f x 9dn x %U f x %dn $if3tfy 3 $ f 3 f 3 f tt $ f 3 3 i" $ 3 f 3 $ 3 3 f 5u 5y5t $ 3 3 "5r x #U

$ C p  u B $ 3 f 5iW8 $ 3 5iw

Third step: If both points and or both points and are outside the -relevant region, the difference of the values corresponding is at most . The same contradiction results.
$ 3 x f 5i6 $ @ P h cp5 $ 3 5i
B

h x

$ x i'% tr

$ 6p% tr

@ 4

because

and

for every

. with

$ 3 5#r

$ 3 "#

F x 5 F x 5s4 B h f 

pv d v i$ 53  7  h $ f 'ri

$53i $ 3 x f "#4

$"# 3 x f

$ 3 f 5i

F x 8 

F x 5s

$ 3 5#r

$ 3 5i

$@cP@ph"x$ifx8rph$pI  1x  f

$ 3 5i

P h t5

$ 3 x f 5i Pth5 $ 'x  e u A $ 3 5i7 x f $ 3  7 5iY $ 53i4f7 x  $53ig 7 xx A g


A A

$c@p"F @ P  h $ 3 f 5#

x 8
A

project

separating lines

Fig. 6. Classication problem; projection of the classication to the plane which is described by the weight vectors of the two hidden neurons, at least one negative point is not classied correctly.

of the respective plane is given by the coefcients . For the plane separating we nd (because of ), (because of ), (because of ). For the plane separating we nd , , ; for the plane separating we nd , , . Hence we would need three different and not parallel planes. Contradiction. Consequently, and are linearly independent. The negative points are contained in a convex region. Because the negative region can then be obtained as the intersection of all tangential hyperplanes of the separating manifold, each positive point is separated by at least one tangential hyperplane of the separating manifrom all negative points. Consider the projection to the plane spanned by fold and which determines the classication of the points. Following the convex curve which describes , i.e. for increasing parameter of the parameterization, the signs of the coefcients of a normal vector can change at most once because with strictly monotonic and , is negative and increasing, is negative and decreasing. The limits of for or , respectively, are and , respectively. In particular, each component of can change its sign at most once. But a normal vector of a hyperplane which separates one of the positive points necessarily has the signs for , for , and for in the dimensions to ; these signs can be computed as the necessary signs for a normal vector of a plane which separates the respective positive point from all negative points as in the linearly dependent case. Independent of the order in which the points are visited if increases, at least one component of changes the sign twice, a contradiction. Hence any classication which maps correctly is of the type as in Case . (4) Property (i): Next we show that the correspondence demanded in property (i) of Theorem 6 holds. An optimum solution of the MAX- -cut problem leads to the following approximate solution of the loading problem with weights , , , , , , where is a positive constant and
 iy f x Yy` $ 3 f 3 f 5tt  P
g

if
P

is in the rst cut

otherwise 23

v $uuut3 3  6 f 3 f 3 f 3 f 3 f F 3 F  y7 z 7 f 7 9 )

( h b 9h b b $ 3 f 3 5u
P W

$ 3 x 3 x 

 (

9  b $ f 3 f 3 tr $ "3tf3t f ( b 5"t $ 3 3 f ( b "3 "3  2 b b $


~

$ 3 3 "5r

7 B

XQwF P 7 Qwg P 7 v $ f 3 f 3 f 3 f 3 f itttt3 53   7 A 3 

$ 3 3 

B h B mV

$ f 3 3 i5r 9  b
B

$  %

A h A oVj

$ f 3 f 3 it

~ s

x A `$  2 

9h (

$ 3 3 t55r f
~

~ H

x U $ x 3 x 3 

b h  b b 7  t
B

7 $  %t b 

Fig. 7. Classication corresponding to a bichromatic edge in the two relevant dimensions for linearly independent and (solid line) and linearly dependent and (dashed line), respectively.

and
 iy

if
P

is in the second cut

otherwise

For appropriate , this misclassies at most the points for which the edge is monochromatic. Note that we can obtain every accuracy with appropriate which is independent of the particular instance. Conversely, assume that an optimum solution of the loading problem is given. This classies correctly. Consequently, we are in Case . We can easily compute an equivalent solution with performing the weight changes as described in Case . Dene a solution of the MAX- -cut problem such that exactly those nodes with are in the rst cut. Any edge such that is classied correctly is bichromatic because in dimensions and the following classication can be found: and are mapped to , and are mapped to (see Fig. 7). (The above are dimensions and , the other are and hence dropped for clarity.) If and are linearly dependent then the positive region is separated from the negative region by at most two parallel lines (in dimensions and ) with normal vector orthogonal to , the positive region is convex. A normal vector of the two lines which bounds the positive region in dimension and , respectively, has obviously two different signs: for and for . Consequently, and have different signs. In the linearly independent case, i.e. when the positive region is separated by a convex curve with normal vector contained in the sector spanned by and , we can nd two tangential hyperplanes of the separating manifold such that these hyperplanes each separate one of the negative points. In the dimensions and , which are the only relevant dimensions for this case because all other dimensions equal for the considered points, we then nd tangential lines which separate the respective negative point from and , respectively. The or component, respectively, of a normal vector of such a separating line is necessarily positive in order to separate or from , respectively, but the signs cannot be equal because of the classication of . Furthermore, each sign of a component of a normal vector can change at most once if equals . we follow the convex curve for increasing . The limit for Consequently, the signs of and have to be different. 24
A h A Vx $ 3 5y
P Q

 nV

$ f 3 iy

 p

 1

$ 3 f 5t

$ 3 x

$ f 3 ir f $ y  y "3 '2

$ 3 f t5t

$ x 3 

B y

$ f 3 f it

A j

$ 3 "

h  t

 $ y  y "3 ir P $ f 3 f it  #y

7  "F

(5) Property (ii): Finally we establish property (ii) of Theorem 6: given a set of weights forming a solution for the loading problem corresponding to an architecture it is possible is classied correctly to compute a cut as demanded in property (ii). Because we can assume that the classication is of Case and without loss of generality, we can assume again that (note that the weight changes in Case can be computed in polynomial time). As before, we can dene a solution of the instance of the MAX- -cut problem via the sign of , i.e., the nodes with positive are in the rst cut (again, note, that this can be computed in polynomial time). If is monochromatic and and are correctly classied then is not classied correctly which is shown using the same argument as before.
$3 y2 y  P  np  'y 
@

Unfortunately, the above situation restricts to the case of -separation which seems realistic for some applications but is nonetheless a modication of the original problem. However, this restriction offers the possibility of transferring the hardness result to networks with activation functions which are similar to the standard sigmoid. .

Corollary 11 It is NP-hard to approximate the loading problem with relative error smaller than for the architecture of a -net with activation function in the hidden layer, activation function H in the output with ( ), weight restriction of the output weights ( ), and examples from , provided that is -approximate of .
h f ui u x@ 3 b $ f 3 P 3 b IWrie62y
9

PROOF. The proof goes via L-reduction from the MAX- -cut problem and Theorem 6. Since it is almost identical to the proof of Theorem 9, except that sigmoidal networks are substituted by -networks which is possible because is approximate of , we only sketch the identical parts and show the modications due to the new activation function . Assume that we are given an instance of the MAX- -cut problem. One can reduce this instance of the MAX- -cut problem to an instance of the loading problem for the sigmoidal network with weight restriction and minimum accuracy . This training set can be loaded with a network with activation function and every accuracy of value less than such that only the points corresponding to monochromatic edges are misclassied. We substitute the function by in this network. Since the weights of the output neuron are at most , the output activation is changed by at most . Hence the -network classies all points but points cor. responding to monochromatic edges correctly with accuracy Conversely, any solution of this loading problem with a network with activation 25
P h P h f u5Wy'i P h t5 tr

$ @  h n2p5

tr

$ @  h nr5

P W

 3 V

Denition 10 Two functions for all


$  $  ti26%o

are -approximates of each other if

P h u5

P P h t'ttf

uf3#dv x

function , accuracy , and weight restriction leads to a solution for a network with activation function with accuracy of the same quality. This is due to the fact that and differ by at most and hence the output activation is changed by at most . Considering the signs of the weights in this sigmoidal network, we can construct a cut in the same way as in the proof of Theorem 9. At most the edges corresponding to misclassied points are monochromatic and the same holds for the -network, too. Hence property (i) of Theorem 6 holds. Conversely, given a solution of the loading problem for the activation function with accuracy we rst substitute the activation by obtaining a solution for of the same quality with accuracy . A computation of a cut in the same way as in the sigmoidal case leads to a cut where every misclassied point for comes from either a misclassication of , , or . Hence this point was misclassied by the network with activation as well. Hence property (ii) of Theorem 6 follows. Since the factors concerning the -reduction are the same as in Theorem 9, we obtain the same approximation bound.

In this section, we consider the approximability of the loading problem with the semilinear activation function which is commonly used in the neural net literature [6,11,14,22]. This activation function is dened as: if

otherwise

It is continuous and captures the linearity of the sigmoidal activation at the origin as well as the asymptotic behavior of the sigmoid for large values. The following result is of interest since it is not necessary to restrict the output activation to the situation of -separation now. Theorem 12 It is NP-hard to approximate the loading problem with relative error smaller than for -architectures with the semilinear activation function in the hidden layer, the threshold activation function in the output, and examples from . Hence we have generalized the approximation results to more realistic activation functions. The proof, which is similar to Theorem 9, can be found in the appendix. 26
3 b $ f 3 P 3 b 8iyry f 3 u# v s P h tt'tif

f g

7 $ &2

lin

if

yr $ f 3 P 3 b

3.2.2 The

- lin H -net

tr

 n

tr

$ @  h n2p5 P h t5  y

P h t5

 t p

P h t5

$ y 3 y F#r

3.3 Avoiding Multiplicities In the reductions of previous sections examples with multiplicities were contained in the training sets. One may ask the question as to whether this is avoidable since the training set in some learning situations may not contain the same pattern many times. Consider the following modication of the reduction of the MAX- -cut problem to a loading problem: yields the following mutually different points:
6 V  $ y  y "3 #2 tu3tfg7g  P 3  #y 3 3 f 7 d 5ugg  a

(i) For an optimum solution of the MAX- -cut problem we can nd an optimum solution of the instance of the corresponding loading problem in which the special points and all points are correctly classied and exactly the monochromatic edges lead to misclassied points or . (ii) For any approximate solution of the instance of the loading problem which classies, for each , at least one point in correctly, we can use the algorithm to compute, in polynomial time, an approximate solution of the instance of the MAX- -cut problem with the following property: for any monochromatic edge in this solution, either or or for all or for all are misclassied. Theorem 13 Under the assumptions stated above the reduction is an L-reduction with constants , , and , where , Corollary 14 The reductions in Theorems 8, 9, and 12 can be modied such that both (i) and (ii) hold. Hence minimizing the relative error within some (smaller compared to those in Theorems 8, 9, and 12) constant is NP-hard even for training sets where no example is repeated more than once. The proofs of Theorem 13 and Corollary 14 can be found in the appendix.
$ $ q x a h $ f a 7 Wp ( p@ q x V1%97 )
g

3.4 Correlated Architecture and Training Set Size The reductions in the previous sections deal with situations where the number of examples is much larger than the number of hidden neurons. This may be unrealistic in some practical applications where one would allow larger architectures if a 27

 1

 p

 n1

 np

and, assume that the algorithms

and

satisfy the following properties:

 #y

 t

points , for each node for each edge

forming the set , , points , , where , points and .

$  h  7 `nrpunY $ f a h a 7 ipp4

$ y 3 y "@#r

$ y 3 y "@#r a

is the degree of

large amount of data is to be trained. One reasonable strategy would be to choose the architecture such that valid generalization can be expected using the well known bounds in the agnostic or PAC setting [34]. Naturally the question arises about what happens to the complexity of training in these situations. One extreme position would be to allow the number of training examples to be at most equal to the number of hidden neurons. Although this may not yield valid generalization, the decision version of the loading problem becomes trivial because of [32]: Observation 15 If the number of neurons in the rst hidden layer is at least equal to the number of training examples and the activation function is the threshold function, the standard sigmoidal function, or the semilinear function (or any activation function such that the class of -networks possesses the universal approximation capability as dened in [32]) then the error of an optimum solution of the loading problem is determined by the number of contradictory points in the training set (i.e. and with .) points The following theorem yields a NP-hardness result even if the number of examples and hidden neurons are correlated. Theorem 16 Approximation of the success ratio function with relative error smaller than ( is a constant, is the number of hidden neurons) is NP-hard for the loading problem with instances , where is a -H-architecture ( and may vary) and is an example set with . PROOF. The proof is via a modication of an -reduction from the MAX- -cut problem. Assume that, in Denition 4 of an -reduction, the algorithm , given an instance , produces in polynomial time an instance of and a parameter , which may depend on the instance , such that the maxima opt and opt , respectively, satisfy opt opt , and the algorithm maps in polynomial time a solution of the instance of cost with relative error at most to a solution of the instance of cost such that the costs and satisfy opt opt . Notice that need not be a constant. Then, assuming that the problem is NP-hard to approximate within relative error , we can conclude immediately that it is NP-hard to nd an approximate solution of instances of problem with relative error smaller than . We term this modied reduction a generalized L-reduction. The algorithms and , respectively, will be dened in two steps: mapping an instance of the MAX- -cut problem to an instance of the MAX- -cut problem with an appropriate and then to an instance of the loading problem as in Theorem 6, afterwards, or mapping a solution for the loading problem to a solution of the MAX-cut problem and then to a solution of the MAX- -cut problem afterwards, respectively. 28
a (  a a ( $  r $ $  r p5  h )   ( ( ( ) (  $  r  Q Q ( ( (  $ 6$ r  ( `$ r (  $  2 ) $   $  2 $$  r rp5  h ) ( $ r h $  r ) $  r   d Un a a  $ f 3 a 3 b i62 S TR  f 3 v u#d $  3  5 a i  a h t a b ( ! I 7  !

( $ "% ! 

$  "% ! 

p3 p2 n2 n3

p1 n1

Fig. 8. Construction of and : The points result by dividing each point on the lines into a pair. Each group, indicated in the picture by and , is to be separated by an additional line.

: given a graph dene (w.l.o.g. assume that with , for where the new edges in have the multiplicity (i.e., copies of each new edge are contained in ). Reduce to an instance of the loading problem for the -Harchitecture with , and the following examples:
7 d pi`nYa q nYTb x 7 $ 3 3 ifVa6b2 $ "3 @ q q q `nVP d VP q a x 3 3 f riye4Uut a x 3 3 f x yT4`nutU S S | VU S VU S &63 irydD97 q $ y y y 3 "3  y i`D97 q $ q 53 q  P a 7 dn1'`Ta $ 3 "&  r $itf3 f3 2rU S  f  S  u  $  3 f 3 3 f 3 3 3  7  if"335u355t5r t q 4'y  $t553y5t"c8  3 3 f 3 3 3  7  $ f  v iur
q

We rst dene ) and

Note that the number of points equals which is at most and at least for large enough . An optimum solution of the instance of the MAX- -cut problem gives rise to a solution of the instance of the MAX- -cut problem with the same number of monochromatic edges via mapping to the same three cuts as before and dening the cut by the nodes in for . This solution can be used to dene a solution of 29
7 d $ `n`5 U q nui42 a P f x
q

n

a 3 P 3 f y5etx erVF3txe$if4y21rg7     $  f 5u3 S G3 U S n

a b $ q x $ f a a P h f ix'pi

$ y  y 3 #r

A

%

 'y

 " 53     9Tx   U  97   7   n  53    

$ w w 

(    

w w 

a 3 3 f 1t

VP

7 

$ $ f x x $ P   x $ rie&t98218if

7 `n`5 a $

yxa33u9 f

n

2

%

 nVP

(1) (2)

copies of the origin , copies of the point (where the is at the position from left) for each , being the degree of , (3) a vector for each edge in (where the two ones are at the and positions from left), (4) copies of each of the points , , where and are constructed as follows: dene the points for and . These points have the property that if three of them lie on one line then we can nd an such that the three points coincide with , , and . Now we divide each point into a pair and of points which are obtained by a slight shift of in a direction that is orthogonal to the line ] (see Fig. 8). More precisely , and , where is the normalized normal vector of the line and is a small value which can be chosen in such a way that the following holds: , , , , and Assume one line separates three pairs, say , , then the three pairs necessarily correspond to the three points on one line, which means . Using Proposition of [26] it is sufcient to choose . Hence the representation of and is polynomial in and .
nVP

n



S VU S

d ea y i

the instance of the loading problem as follows: for neuron in the hidden layer: the weight, for , is chosen as
y  y x 

if

is in the

otherwise

the threshold is chosen as , the , and weights are chosen as which corresponds to the line through the points , , and . the output unit has the threshold and all weights are , i.e. it computes the function AND of its inputs .
   w x  f $ x t |

With these choices of weights it is easily seen that all examples except the points corresponding to monochromatic edges are classied correctly. Conversely, an optimum solution of the loading problem classies all points in 1, 2, and 4 and all points corresponding to edges in correctly because of the multiplicities of the respective points. As in the proof of Theorem 8 we can assume that the activations of the neurons do not exactly coincide with when the outputs on are computed. Consider the mapping which is dened by the network on the plane
t q ( v 3  v $ f 5i3 ( v  3  v 3 3 3  5y 

The points and are contained in this plane. Because of the different outputs each pair is to be separated by at least one line dened by the hidden neurons. Hence the lines nearly coincide with the line through , . Denote the weights and the threshold of the output neuron of the network by and , respectively. We can assume that the neuron maps to and to for all , since otherwise we change all signs of the weights and the threshold in neuron , we change the sign of the weight , and increase by to satisfy this condition. Hence , for all and therefore for all with . This means that the output unit computes the function NAND on binary values of .
 " 53    v

Dene a solution of the instance of the MAX- -cut problem by setting the cut as the hidden neuron maps to . Because of the classication of the points all nodes are contained in some cut. Assume some edge is monochromatic. Then and are mapped to by the same hidis mapped to also because of the classication den neuron, hence the vector of the origin. Hence is classied incorrectly. All points corresponding to 30
$  5  f

 6 f

&$

g

 ruf  a

33  % Va3u"3  3 f 3  x X g h

X &$ x a

P Q

n

 p

Uf |  p v  $

(     $ $ f  P x P 3 f 3 rir"'9XQwtuf $ P x tt q n i q 5 $ f x P Wy

3 3  2

3 3

n

3 

f|   n  33  a 3 3 f 7 tg

w  x #x 

P f

$ y 3 y "@#r

y i

cut

edges in are classied correctly, hence each of the nodes form one cut and the remaining nodes are contained in the remaining three cuts. These three cuts dene a solution of the instance of MAX- -cut such that all edges corresponding to misclassied s are monochromatic. Denote by opt the value of an optimum solution of the MAX- -cut problem and by opt the optimum value of the loading problem. We have shown that
 7
S S | VU S VU S y 3 "3  y

Next we construct . Assume that a solution of the loading problem with relative error smaller than is given. Then the points 1 and 4 are correct due to their multiplicities. Otherwise the relative error of the problem would be at for appropriately small and large . As beleast fore we can assume that the output neuron computes the function NAND . Dene opt to be the value of an optimum solution of the loading problem and the value of the given solution. Assume that some point corresponding to an edge in is misclassied. Then yields an arbitrary solution of the MAX- -cut problem. For the quality of this solution compared to an optimum opt we have
( ( $  f  

opt

opt

This holds because an optimum solution of the loading problem classies correctly at least more points than the solution considered here. If all points corresponding to edges in are classied correctly then we dene a solution of the MAX- -cut problem via the activation of the hidden neurons as discussed above. Remaining nodes become members of the rst cut. An argument similar to above shows that each monochromatic edge comes from a misclassication of either , , or . Hence
( ( $    t q d  n

opt

opt

31

h $ a `nVrp

P f x n1i

With , using Theorem 3, our result follows.


q

for appropriate constant , and

 n1

x t

Ve P

a 1eXt t a

The quantity which is computed by the algorithm equals positive constant which can be chosen appropriately such that .
 ( h $ a dn1pn
q

a n

P f x n1i

a h yu

Vi P f x

t q

P n1e

`n

 p

$ a pn

opt
(

a n

P f x nV

x $ n`ni

P f x n1i

a h u

n1P

a t y#7

 n59x 

opt

opt

 np

VP

 t

P f x nVi

VP

P  h nV6p

P h 7 tt

td

( n
$

, being a xed

Vif P

 6

In the above theorem, the number of points is upper bounded by a term involving the number of hidden neurons. Since the approximation factor depends on the number of hidden neurons, we added the lower bound which excludes situations where every point is to be classied correctly due to the bound of the approximation ratio and the size of the training set.
a

4 Hardness of Approximating the Failure Ratio Function In the remaining part of this paper we consider another objective function, the objective of minimizing the failure ratio. We use the notations introduced in Secthe numtion 2. Given an instance of the loading problem, denote by ber of points in the training set (counted with multiplicities) misclassied by a network . Given a constant , we want to nd weights such that opt opt where denotes the network with our weights. Notice that if opt , this is equivalent to investigating if the failure ratio can be bounded above by a constant. Hence this problem is referred to as the problem of approximating the minimum failure ratio within a constant while learning in the presence of errors [2]. If restricted to situations where a solution without errors exists this only yields the original loading problem since no errors are allowed in the approximation either. Hence we restrict to situations where no solution without misclassied points exists.
$ 9% f TR  $ ! 3  "2R f gR ! $ % h% $ "2&`R $ ! 3  !

4.1 Approximation within constant factors We want to show NP-hardness of approximation of within some bound by a layered H-net. It turns out that the bound on approximation of for which we can prove NP-hardness is a constant independent of the number of neurons of the network architecture. For our purpose we use a reduction from the set-covering problem. Denition 17 (Set Covering Problem) [15] Given a set of points , ..., and a set of subsets of , nd indices such that . If such a set of indices exists, then the sets , are called a cover of (or, said to cover ). A cover is called exact if the sets in the cover are mutually disjoint. The goal in the optimization version of the set covering problem is to nd a set of indices for a cover with being the minimum possible. Denition 18 (Satisability Problem) [15] Given a Boolean formula , in conjunctive normal form, over a set of variables , nd a truth assignment which satises the formula (i.e. makes the value of the formula true). 32

3 g&"@ R 3 3 f g"t i 7

f R

3 "3  # 7

7 

tG

Both the satisability problem (or SAT problem for short) and the set-covering problem is known to be NP-hard [15]. For the set-covering problem the following result also holds, showing that it is NP-hard to approximate this problem within . every constant factor Theorem 19 [7] For every there is a polynomial time reduction that, given an instance of SAT, produces an instance of the set-covering problem and a number with the properties: if is satisable then there exists an exact cover of size , but if is not satisable then every cover has size at least . Using Theorem 19, Arora et.al. [2] show that approximating the minimum failure ratio is NP-hard for the simple perceptron model, i.e. nets with threshold activation function, for every constant if the threshold of the output neuron is zero. We can obtain a similar result for arbitrary layered H-nets where the thresholds of the neurons in the rst hidden layer are xed to . Theorem 20 Assume that we are given a layered H-net where the thresholds of the neurons in the rst hidden layer are xed to , the number of neurons in the rst hidden layer is xed, and the input dimension varies. Let be any given constant. Then the problem of approximating the minimum failure ratio for such an architecture while learning in the presence of errors within a factor is NP-hard.
f h 4 b 3 b $ f 3 b Riiry f h

PROOF. The case without any neurons in the hidden layers is already proved in [2], hence we assume that at least one hidden layer is present. Assume that we are given a formula of SAT. First, we transform this formula in polynomial time (with the given constant ) to an instance of the set-covering problem and a constant such that the properties in Theorem 19 hold. Next, we transform this instance of the set-covering problem to an instance of the loading problem for the given architecture with input dimension (where denotes the number of neurons in the rst hidden layer) and the following examples from : ), copies of the points e , e , where is the vector with component as if and only if , , (III) copies of the points , , and , where the component is nonzero in all three points and the component is nonzero in the latter two points, (IV) copies of the points , , , , for all vectors , , where , ..., ) are constructed as follows: the points and (for select points in each set 33
$ f  f 3 t@  $ f i w v ( 7 w & $ v 3 3 3  7  7   %# $ f ix  2  b b w'u 7 tp'utg  v f $ f 3 3 f  w v f 3 f 3 (  it3 f3 ( r t5t3  f3 ( $ f  f $  f $ f i
t

f x 3 f 3 $ R P h f w v 5td'@i3

w '"ttFi v 3 f 3 f 3

R 

$ f i

$ f i

w '5utF6  v 3 f 3 f 3

(II) e

3 3 f w v 5y5t3

$ f i

3 f 3 3 w v 5u5@

r P x R



f 7

$t5 w v 53tf3$d'pi3  R P h f  et& 3 3 f f 3 u#`6 R xxf 3 f 3 3 w v 5u5@ 

$ f i

(I) e

,e

$ 3 '53  #Y5'3  7 3 3

being the

f 3 v uy#dw

7 x6

f h 9

 b

f h

f 9x  xYg7 b x P x R

f9x  b  t $  f 3 t5tF

3 w r

3 ( 

unit vector (for

(denote the points by , , . . . ) such that any of these points lie on one hyperplane if and only if they are contained in one . Such points exist as shown in [26]. For dene by and by , ..., , , , ..., , for some small value which is chosen such that the following property holds: if one hyperplane in separates at least pairs , these pairs coincide with the pairs corresponding to the points in some , and the separating hyperplane nearly coincides with the hyperplane through . It is shown in [26] that such points exist and an appropriate can be computed depending on the points . For an exact cover of size , let the corresponding set of indices be , ..., . Dene the weights of a threshold network such that the neuron in the rst hidden layer has the weights e e where the component of e is if and only if and e is the unit vector in . Each of the remaining neurons in the other layers computes the function of their inputs . Since the cover is exact, this maps all examples correctly except examples in (I) corresponding to sets in the cover. Conversely, assume that every cover has size at least . Assume, for the sake of contradiction, that there is some weight setting that misclassies less than examples. We can assume that the activation of every neuron is different from on the set of examples: for the examples in (IV) the weight serves as a threshold, for the points in (I), (II), and (III) except for the weight serves as a threshold, hence one can change the respective weight which serves as a threshold without changing the classication of these examples such that the activation becomes nonzero via enlarging the respective weight by , being the maximum negative activation of the neuron. Assuming that the activation of is zero we can increase the weight such that the sign of the activation of all other points which are affected does not change. The precise value can be computed in polynomial time depending on the other weights and activations of the points. Because of the multiplicity of the examples we can assume that the examples in (II)-(IV) are correctly classied. We can assume that the network function has the form where is the function computed by the neuron in the rst hidden layer because of the points in (IV). This is due to the fact that the points and enforce the respective weights of the neurons in the rst hidden layer to nearly coincide with weights describing the hyperplane with coefcient zero. Hence the points are mapped to the entire set by the neurons in the rst hidden layer and determine the remainder of the network function. Hence all neurons in the rst hidden layer classify all positive examples except less than points of (I) correctly and there exists one neuron in the rst hidden layer which classies the negative example in (III) correctly as well. Consider this the (vector of) weights of this neuron. Because of the last neuron. Denote by examples in (III), . Dene . 34
$ f i ( 2#'5t3 w v 3 f $dR#P@ph 

f x  b f x  b  i

 7    5i w v #Y t  7 6 f 3 w v u#

Y w v u

r

 R 3 3 f  tg"t

w v 'g  h 'n
%

f x  b w v

 g $ t53  durpit  3 $ R  h f 3 f 3

$ f i ( w '"t3 v 3 f v 

 i

7 i

w v $   5 w w v  3  7 3 $ 533  3  553  53  5i t  

$  



w v

h 

$   

I7 j
e

$ 2

7 $  3  w5r

 1

3 @

S S

&

&

h 3 We  f 3 u# 0 1)

f x  b

wf

Assume that forms a cover. Because of the examples in (III) we have and . Therefore one of the examples in (I) is classied incorrectly for every . This leads to misclassied examples because every cover is of size . This is a contradiction. Otherwise, assume that does not form a cover. Then one can nd for some and the point e in (II) an activation which is because it holds that , (III). This yields a misclassied example with multiplicity . This is again a contradiction. We can now complete the proof of the theorem very easily. Assume, for the sake of contradiction, that we can approximate the minimum failure ratio of the loading problem within a factor of . Then we could transform, in polynomial time, a given instance of the SAT problem to an instance of the set cover problem and then to an instance of the the loading problem as described above with the following property: if is satisable, then the loading problem has a solution with misclassications, if is not satisable, then every solution of the loading problem has at least misclassications.


in polySince, we can approximate the minimum failure ratio within a factor nomial time, we can decide, given an approximate solution of the loading problem, whether the loading problem has a solution with misclassications, or whether alternatively every solution of the loading problem has at least misclassications. This means that we would then know if is satisable or not, thereby solving the SAT problem, an NP-hard problem, in polynomial time. Since an arbitrary constant , which is independent of the architecture, may be used in the above theorem, this theorem suggests that in the presence of errors training may be extremely difcult. 4.2 Approximation within large factors Assuming that NP DTIME poly one can show that even obtaining weak approximations, i.e. approximations within some large factor which depends on the input dimension, is not possible . For this purpose a reduction from the so-called
G pv Fnlj o mk p o mk G v Fnlj C $ p o mk v Fnlj i f h cd
%

35

A quasi-polynomial time algorithm is an algorithm that runs in is the size of the input and poly is a xed polynomial in
G 2C

poly

time, where . DTIME poly

x8$dR'P@ph  $ 

(    h  g x $ R P 6d'ph  (  (     Pth  7 x  $ R P pd'@h  o w v 53tf3tf3Fi   1gt#

b 2

(  h x $ R P 8d'@h   1W# I

label cover problem is used. Denition 21 (Label Cover) Given a bipartite graph with , labels in sets and , and a set , a labeling of consists of functions and which assign labels to the nodes in the graph. The cost of this labeling is . An edge is covered if both and are not empty and for all there exists some with . A total cover of is a labeling such that each edge is covered. The goal for the optimization version of the label cover problem is to nd a total cover with minimum cost. For the label cover problem the following result holds, showing that it is almost NP-hard to obtain weak approximations. Theorem 22 [2,23] For every xed there exists a quasi-polynomial time reduction from the SAT problem to the label cover problem which maps an instance of size to an instance of size with the following properties: If is satisable then has a total cover with cost . If is not satisable then every total cover has a cost of at least . Furthermore, in both cases satises the property that, for each edge and , at most one exists with .
7 dn

Using this Theorem and ideas of Arora et.al. [2] we can prove the following theorem. Theorem 23 Assume that we are given a layered H-net (where is xed and is the varying input dimension) where the thresholds of all the neurons in the rst hidden layer are xed to and let be any given constant. If the problem of approximating minimum failure ratio while learning in the presence of errors for this architecture within a factor can be solved in polynomial time, then NP DTIME .
$ b 3 3 v ut @q oFnk x v s m P f R h ( b "3  "r b 3 b $ p o mkj k v Fn Fm

PROOF. We can assume that since otherwise the result is already proven in [2]. Assume that we are given a formula of the SAT problem of size . We transform with the given constant to an instance of the label cover problem of size with the properties as described in Theorem 22 for this .
refers to the class of problems that can be solved by a deterministic Turing machine in quasi-polynomial time. More information about these and related topics is available in any standard textbook in structural complexity theory, such as [5,15]. We omit any precise denition of the size of an instance of the SAT problem and the size of an instance of the label cover problem, since those will not be necessary.
$ 3 

36

3 y 12

v u sqom t @Fnk P

7 U

@ $ 3 3 5WE& 7 $  d

 7` $ y  yrn

$ 3 F 3 e5ti

p o mkj k P v Fnl Fm !9

Y U

h U

b 2

4 $ 3 Gr

e5ti $ 3 F 3  $ y y2

$ 3 r

$ 3 

pxv

$ y  yrF

3 y 1r

 b

First, following the same approach as in [2], we delete all in such that for some edge incident to no exists with . The remaining labels are called valid labels. The cost of a total cover still remains if is satisable (since the label cover in such a case uses only label for a vertex). Otherwise, if is not satisable, then this deletion can only increase the cost of a total cover. After these deletions, by Theorem 22, for each and there exists a unique such that . We can assume that a total cover exists, since this can be easily checked in polynomial time. Now transform this instance to an instance of the loading problem. The input dimension is where , are the edges, and and are the labels. By the results in [2] (see Theorem 7, Lemma 9 and Theorem 13 of [2]) (with ) Hence . Let and for notational simplicity. The following examples from are constructed: (the rst components are successively identied with the tuples in and and denoted via the corresponding indices.) copies of the points ( ), , , , where the points , , are the same points as in the proof of Theorem 20. (II) copies of , (III) copies of and , (IV) copies of each of e , e , where e is precisely at those places such that is a valid label for and otherwise, and e is precisely at the places such that ( , ). (V) copies of each of e , where e is precisely at those places such that is a valid label for and is not assigned to and at the place and otherwise ( ). (VI) e , where e is precisely at those places such that is a valid label for . We now prove the following two claims: (a) If a total cover with cost exists, then the number of misclassied points in an optimum solution of the loading problem is at most . (b) The number of misclassied points in any optimum solution of the loading problem is at least , the minimum possible cost of a total cover. Assuming both (a) and (b) are true, we can complete the proof of our theorem easily as follows. Given an instance of the SAT problem, we can transform this instance in quasi-polynomial time to an instance of the label cover problem for the given and then to an instance of the loading problem as shown above such that

y F $31yr F f $ f i 2'#5t""  w v 3 f 3 3     y $ y"3  $ F 3 y r y F $ F 3 y 1r f $ w v 53tu a  f f 3 f 3    4y $e53  f y F $ F 3 y GV2 f $i w '5utf6  $i w v'5tt3  f v 3 f 3 3 f 3 f 3 f $t5 w v 5tf3(UupilS T 3 $ R q f h f 3 S $ if  w v "3t"UupilS r f 3 $ ( R q f h f 3 S z z $ f i w '"5t3 S S r v 3 3 f z 

37

$ f  f 3 t@

3 ( r

$ f  f it3

`n

3 ( 

 t  $  f t5t3  f3 ( r

`n

$  f 5t3 

3 ( 

(I)

$ 3 F 3 y"iG$

T$ 3 F "iG3 q i  q 3 y V2 i 7 @ w

`n

@ uf3# v d T@ 7 3 @ 7 l nuTV#R 5 sx b  gb x $ x  T@ `nxpn nfd 5 @ f x  xgb b x P x 7

x @ 7 Tdd

$ 3 F 3 e5t6

$ F 3  7 Wt6V

the following holds: if is satisable, then the loading problem has a solution with misclassications, if is not satisable, then every solution of the loading problem misclassies at least points.
U5p w v v j t @rFnk P v u sq om U `np w v v j t @Fnk P v u sq om U
&

Assume, for the sake of contradiction, that we can approximate the minimum failure ratio of the loading problem within a factor smaller than . Then, we can decide in quasi-polynomial time, given an approximate solution of the loading problem, whether the loading problem has a solution with misclassication, or whether alternatively every solution of the loading problem has at least misclassications. This means that we would then know if is satisable or not, thereby solving the SAT problem, an NP-hard problem, in quasipolynomial time. Since this holds for every and for large , the result as stated in the theorem follows. The remainder of this proof is devoted to proving the claims (a) and (b) above. exists. Dene First, we prove claim (a). Assume that a label cover with costs the weights for the neurons in the rst computation layer by is assigned to , is assigned to , , . The remaining coefcients of the neuron in the rst hidden layer are dened by: , the remaining coefcients are . The neurons in other layers compute the function of their inputs . This maps all points but at most points in (VI) to correct outputs. Note that the points in (V) are correct since each is assigned precisely one . This concludes the proof of (a). Now we prove (b). Assume that a solution of the loading problem is given. We show that it has a number of misclassied points which is at least the cost of an optimum total cover. Assume for the sake of contradiction that less than points are classied incorrectly. Since a cover has a cost at most we can assume that all points with multiplicities are classied correctly. Because of the same reasoning as in Theorem 20 we can assume that the activation of every neuron is different from on the training set. Additionally, we can assume that the output of the circuit has the form where is the function computed by the neuron in the rst hidden layer, because of the points in (I). Hence all neurons in the rst hidden layer classify all positive examples except less than points of (V) correctly and there exists one neuron in the rst hidden layer which classies the negative example in (III) correctly as well.
$(URurph   p $dR'ph  z P h  p  z h j S S  h  z (  $ ( R#Pph7  f z f Y7  p j  F fY7 d
z

Denote by the weights of this neuron. Because of (II), node with those valid labels such that the inequality holds. Label the node with those labels such that
a F

38

$ ( Ruqfphifc y

If this labeling forms a total cover, then we nd for all assigned to activation smaller than . Due to (III),
(
z

p w v v j t @Fnk v u sq om P

. Label the .

in (VI) an

v v t @Fnk P u sq om

j 

 1

 i

h g

$  

x &$ ( trh  R 

w v

$   

f 7

7 $  3  5r

j 

`n5p w v v j t @q Fnk P v u s om

v u sq om t @rFnk P

f 7 

0 1)

Assume otherwise that this labeling does not form a total cover. Then some or is not labeled, or for some label for and edge no is assigned to with . Due to (IV) we nd , hence valid for together with (III) , hence at least one valid for is of size at least . In the same way we nd , hence at least one is of size at least . Consequently, each node is assigned some label. Assume that the node is assigned some such that the edge is not covered. Hence . Due to the points in (V) we nd that the inequality valid for , holds and due to (IV) we nd , hence we can valid for conclude: valid for , valid for , Hence at least one weight corresponding to a label which can be used to cover this edge is of size at least . This concludes the proof of (b).
 $ ( urph  z R   $ R  durph z h $ $   (  (  (   p  p $dR'phi P f h z P j 7 z a j x z  z  z  p (  p  h j ! cd{ o z z h j cd{ op (   p   h z x z j   p  (   p gh z x  z x a j j ! cd{ op W  p  $ R P d'ph  z h j   p  j $ R P d'@ph  z (   p j   xh z x  z a $ R P d'@ph  z    p p j  $ ( uph  z R q f  z h j s p j  x y"iG3 $ 3 F

5 Conclusion and open questions We have shown the NP-hardness of nding approximate solutions for the loading problem in several different situations. They can be seen as generalizations of the classical result of [9] to more realistic situations. We have considered the question as to whether approximating relative error within a constant factor is NP-complete. -network with the sigmoidal (with Compared to [4] we considered the separation) or the semilinear activation function. Furthermore, we discussed how to avoid training using multiple copies of the example in the NP-hardness results. We also considered the case where the number of examples is correlated to the number of hidden neurons. Investigating the problem of minimizing the failure ratio in the presence of errors yields NP-hardness within every constant factor for multilayer threshold networks (with a xed number of neurons in the rst hidden layer and all thresholds in the rst hidden layer xed to ). Assuming stronger conjectures in complexity theory, we established that even weak approximations cannot be obtained in the same situation. Several problems still remain open in this context, some of which are unsolved even if we ask the existence of an exact solution instead of an approximate solution: (1) What is the complexity of training multilayer threshold networks if restricted to binary examples? In [9], the NP-completeness for the architecture with binary examples is shown. For more hidden neurons this is unsolved if only one output neuron is present. Some work for multilayered architectures 39
$ f 3 P 3 b iy56r f h $ f 3 P 3 b iy62

xh

x 

, hence the activation is smaller than points which is at least .



z

and leads to a number of misclassied

( urpif R s h   z

y r

 y

(3)

(4)

(5) (6)

References
[1] E. Amaldi and V. Kann, The complexity and approximability of nding maximum feasible subsystems of linear relations, Theoretical Computer Science 147(1-2) (1995) 181210. [2] S. Arora, L. Babai, J. Stern and Z. Sweedyk, The hardness of approximate optima in lattices, codes and systems of linear equations, Journal of Computer and System Sciences 54 (1997) 317331. [3] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, and M. Protasi, Complexity and Approximation (Springer, Berlin, New York, 1999). [4] P. Bartlett and S. Ben-David, Hardness results for neural network approximation problems, Theoretical Computer Science 284(1) (2002) 5366. [5] J. L. Balc zar, J. D and J. Gabarro, Structural Complexity I, EATCS Monographs a iaz on Theoretical Computer Science (Springer, Berlin, New York, 1988).

40

$    h n6 pn  n 

$ f i3  "r b 3 b

(2)

can be found in [28]. What is the complexity of training networks with the sigmoidal activation function in the hidden neurons? [19,35] show some situations to be NP-hard; however, they consider networks which are used for interpolation instead of classication, i.e., the quadratic error is to be minimized. Since classication is an easier task NP-hardness seems more difcult to prove. For which classes of activation functions can the result for the sigmoidal case still hold? Actually, we only use some properties of the sigmoid, such as the fact, that it is continuously differentiable, monotonous, symmetric, bounded, and that in case in the proof the set of points classied positive is convex. What is the complexity of nding an approximate solution if the number of examples is restricted with respect to the number of neurons in the hidden layers? We obtained one result in this context, but only with error bounds which depend on the number of hidden neurons. Can a general argument be found which will show the validity of the NPhardness results for examples without multiplicities? We used a step by step analysis. What are the characteristics of a set of examples for which loading is NP-hard? It is well known that pairwise orthogonal training examples can be classied correctly even without hidden neurons. Can, for example, an example set with limited correlation of the points, i.e. bounded values for all pattern and some constant , be loaded efciently? Some investigation concerning this topic can be found in [30]. The authors in [8] show that the situation of [9] changes if the input examples come from a specic (realistic) input distribution; then training is possible in polynomial time.
 I7   P

[6] R. Batruni, A multilayer neural network with piecewise-linear structure and backpropagation learning, IEEE Transactions on Neural Networks 2 (1991) 395403. [7] M. Bellare, S. Goldwasser, C. Lund, and A. Russell, Efcient multi-prover interactive proofs with applications to approximation problems, in: Proceedings of the 25th ACM Symposium on the Theory of Computing (1993) 113131. [8] A.L. Blum and R. Kannan, Learning an intersection of halfspaces over the uniform distribution, in: V. P. Roychowdhury, K.Y. Siu, and A. Orlitsky, eds., Theoretical Advances in Neural Computation and Learning (Kluwer, Dordrecht, The Netherlands, 1994) 337356. [9] A. Blum and R. L. Rivest, Training a 3-node neural network is NP-complete, Neural Networks 5 (1992) 117127. [10] L. Blum, F. Cucker, M. Shub, and S. Smale, Complexity and Real Computation (Springer, Berlin, New York, 1998). [11] J. Brown, M. Garber, and S. Vanable, Articial neural network on a SIMD architecture, in: Proc. 2nd Symposium on the Frontier of Massively Parallel Computation (Fairfax, VA, 1988) 4347. [12] M. P. do Carmo, Differential Geometry of Curves and Surfaces (Prentice Hall, New York, 1976). [13] T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein, Introduction to Algorithms (The MIT Press, Cambridge, MA, 2001). [14] B. DasGupta, H. T. Siegelmann, and E. D. Sontag, On the Intractability of Loading Neural Networks, in: V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky, eds., Theoretical Advances in Neural Computation and Learning (Kluwer, Dordrecht, The Netherlands, 1994) 357389. [15] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness (Freeman, San Francisco, 1979). [16] B. Hammer, Some complexity results for perceptron networks, in: L. Niklasson, M. Bod n, T. and Ziemke, eds., ICANN98 (Springer, Berlin, New York, 1998) 639644. e [17] B. Hammer, Training a sigmoidal network is difcult, in: M. Verleysen, ed., ESANN98 (D-Facto publications, Brussels, 1998) 255260. [18] K.-U. H ffgen, H.-U. Simon, and K. S. Van Horn, Robust trainability of single o neurons, Journal of Computer and System Sciences 50(1) (1995) 114125. [19] L. K. Jones, The computational intractability of training sigmoidal neural networks, IEEE Transactions on Information Theory 43(1) (1997) 167173. [20] J. S. Judd, Neural network design and the complexity of learning (MIT Press, Cambridge, MA, 1990). [21] V. Kann, S. Khanna, J. Lagergren, and A. Panconesi, On the hardness of approximating max-k-cut and its dual, Technical Report CJTCS-1997-2, Chicago Journal of Theoretical Computer Science 2 (1997).

41

[22] R. Lippmann, An introduction to computing with neural nets, IEEE Acoustics, Speech, and Signal Processing Magazine 4(2) (1987) 422. [23] C. Lund and M. Yannakakis, On the hardness of approximate minimization problems, Journal of the ACM 41(5) (1994) 960981. [24] W. Maass, G. Schnitger, and E. D. Sontag, A comparison of the computational power of sigmoid versus boolean threshold circuits, in: V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky , eds., Theoretical Advances in Neural Computation and Learning (Kluwer, Dordrecht, The Netherlands, 1994) 127151. [25] A. Macintyre and E. D. Sontag, Finiteness results for sigmoidal neural networks, in: Proceedings of the 25th ACM Symposium on the Theory of Computing (San Diego, 1993) 325334. [26] N. Megiddo, On the complexity of polyhedral separability, Discrete Computational Geometry 3 (1988) 325337. [27] C. H. Papadimitriou and M. Yannakakis. Optimization, Approximation and Complexity Classes, Journal of Computer & System Sciences 43 (1991) 425440. [28] C. C. Pinter, Complexity of network training for classes of neural networks, in: K. P. Jantke, T. Shinohara T. and Zeugmann, eds., ALT95 (Springer, Berlin, New York, 1995) 215227. [29] R. D. Reed and R. J. Marks, Neural smithing (MIT Press, Cambridge, MA, 1999). [30] M. Schmitt, Komplexitat neuronaler Lernprobleme (Peter Lang, Bern, 1996). a [31] J. Sim` , Back-propagation is not efcient, Neural Networks 9(6) (1996) 10171023. [32] E. D. Sontag, Feedforward nets for interpolation and classication, Journal of Computer and System Sciences 45 (1992) 2048. [33] M. Spivak, A comprehensive introduction to differential geometry, vol.15 (Publish or Perish, Boston, Mass., 19701975). [34] M. Vidyasagar, A theory of learning and generalization (Springer, Berlin, New York, 1997). [35] V. H. Vu, On the infeasibility of training with small squared errors, in: M. I. Jordan, M. J. Kearns, and S. A. Solla, eds., NIPS 10 (MIT Press, Cambridge, MA, 1998) 371377. [36] P. Werbos, The roots of backpropagation (Wiley, New York, 1994).

42

APPENDIX

Proof of Theorem 8:

The proof is via Corollary 7 by giving a reduction from the MAX- -cut problem, with and , that satises the properties (i) and (ii). By Theorem 3 we may assume that for the proof of .
 b uf3y#Wv $ 3 "&

for all vectors where for , ..., and for , ..., , where the points and are constructed as follows: Choose points in each set , denote the points by , , . . . and the entire set of points by . The points are chosen such that the following property holds: any given different points in lie on one hyperplane if and only if they are contained in one . Such points exist as shown in [26]. For dene by and by for some small value which is chosen such that the following property holds: if one hyperplane in separates at least pairs , these pairs coincide with the pairs corresponding to the points in some , and the separating hyperplane nearly coincides with the hyperplane through . It is shown in [26] that an appropriate can be computed depending on the points . The role of the points and is to enforce that in any network classifying these points correctly the neurons in the rst hidden layer nearly coincide with the hyperplanes through . As a consequence, the points are mapped to the entire set in the rst hidden layer by such a network and therefore determine the function which the remainder of the network computes. is the unit vector (for ). The corresponding examples for each are . is a vector with at positions and from left and otherwise. The corresponding example is .
 (  w v ( I3&Xw v $ 7 7 3 3 r3  2#  7  7    t S f 7` $5uf3  3 VU S   $ f ix b f ` 7  r  b w u7 v f $ f 3 3 f  it w uu  v f 3 f w v  f x  b w v '

We rst establish property (i) as required by Theorem 6. Given an optimum solution of the MAX- -cut problem, we choose the threshold of the neuron in the rst 43
 b

w v  3  7 3 $ 53  3 3  5w3  53  56 w v 3 $ 533  3  3 7 w v t  5"3  53  5i d t 

`n

$ifp f

$  5 y

$ 2

fx  b 3  t 

The set of special points


e e h f 9x  b $ ifx  r  b b S $t@ t 3 VU S r f  f 3  $  f ift3 f3 VU S r S  $ f 5t3  f3 VU S r S  $ f  v  iy'r

(together with their labeling) are the points:

7 xb

The reduction is as follows. An instance graph problem is mapped to the following set of examples in :
7  ( f x  U b x

of the MAX- -cut with

$r$ifxa@uphg f 7 f x  'x  'w v u b P ( b P x P 7 2 

w iu# v f 3

f gx  b

v f 3  u#Uy

f 3  v u#y

b 7  a f

x  b

otherwise

Assume conversely that an optimum solution of the loading problem is given. Since we have constructed a solution without errors on , every optimum solution or solution with relative error smaller than classies correctly. After changing the thresholds if necessary, we can assume that no point in the training set leads to an activation of exactly for one of the neurons. It is sufcient to increase each threshold by , being the maximum negative activation of the (nite set of) different inputs to this neuron. hyperplanes dened by where is Consider the the output of the neuron in the rst hidden layer. Because the points and are classied differently for each , the points and lie on different sides of at least one of these hyperplanes. Hence the hyperplanes nearly coincide used in the construction. The points with the hyperplanes through are mapped to the entire set by . Since H H for we can assume, after a standard weight change if necessary, that the point is mapped to by all neurons in the rst hidden layer. Because of the classication of the points the remaining part of the network necessarily computes the logical function AND, that is, holds for every and the network function . Enlarging the respective weight in any neuron in the rst hidden layer if necessary, such that the sum of this weight and the threshold is at least , we can obtain a solution of at least the same quality where the point is classied correctly for any . Now we dene a solution of the MAX- -cut problem by setting the cut where denotes the function computed by the neuron in the rst hidden layer as before. Note that for every at least one neuron with activation exists. Assume an edge is monochromatic. Consequently, the output of at least one neuron, say hidden neuron , is zero for both and . Because of the classication of the origin the threshold of this neuron is positive, i.e. has the output for neuron , too, and is classied incorrectly. Hence, property (i) is established. In order to establish property (ii) of Theorem 6 dene as follows. Given a socorrectly we have already seen lution of the loading problem which classies how to compute in polynomial time, by changing the signs of some weights if nec44
(  y i $ y 3 y '"r  i a I7 $ yt |

  t $if3  t 3 XU S  S  i 7 $ f 3  3S S   yi5lVU Tw v Wx# 

$ f i3 

w v $  

7 n1 T

and e is the unit vector. All other neurons compute the function of their inputs. This maps all points correctly except for the points for which the edge is monochromatic.
 $ y  y 3 #r 

$  f 7 $ g&2 S f3 XU S r

$  

$ 2 3  0 ) 7 $  3  &54r



w v

0 )

3 3  e

$ f 3 

w #uy# v f 3 

S f XU S

y r

the node

is contained in the

$ P 3 uWw@ 3

S VU S

3 53  

r a 3 7 $  ti

P Q

h 'n

hidden layer as
f f  b $ f i3  3

and the weights as

where

cut

$ f i3

S c XU S

7 i
S e VU S

r

 I7

 t

essary, an equivalent solution such that the origin is mapped to by all neurons in the rst hidden layer. The node is contained in the cut if is mapped to by the neuron in the rst hidden layer, but to by neurons , . . . , . All nodes where is mapped to by all neurons in the rst hidden layer are contained in the rst cut. Since is correctly classied the network function is of the form , being the output of the hidden neuron in the rst hidden layer. Hence every monochromatic edge for which and are correctly classied and hence some or , respectively, yield , leads to a incorrectly classied because of the classication of the origin.
 b ( 

The algorithms

and

are polynomial time since

is a constant.

Proof of Theorem 12: of the MAX- -cut with :


qx UYTb 7 P

and are the same vectors with the same labeling as in the sigmoidal case (see proof of Theorem 9).
   y

Again we want to see how a classication looks like. We consider the points which
97 z x $ F x 5T  B x $ x g
A e

holds. The weights are denoted as in the proof of Theorem 9. We can assume , , if is loaded correctly. If we are only interested in the geometric form of the output of the network, we can assume, by an argument similar to what 45
I7 9x  I7 YA I 7 )

lin

lin

$  f 3 P 3 P f 3 3 3 S "tyQ5yQt5y553 VU S r

$  f 3 P P 3 P q 3 3 3 S "tyQyEyQw5y553 VU S r

$  f 3 3 3 3 3 S 5up 55y553 VU S r

$  f 3 P 3 P f 3 3 3 S S 5tyWy5yQt5y553 VU Tr

$  f 3 P P 3 P q 3 3 3 S 5tyWeEyQ55y553 VU S r

$  3 3 3 3 3 f S t55"55y5t3 VU S r

$  3 3 3 3 f 3 S t55"55u53 VU S r

$  3 3 3 f 3 3 S t55"5ty553 VU S r

$  3 3 3 f 3 f 3 f S t55"5tut3 VU S r

$ f  f 3 P 3 f 3 3 3 S iuyQ5t5y553 VU S r

$ f  f 3 P P 3 q 3 3 3 S iuyQyEw5y553 VU S r

$ f  f 3 P 3 f 3 3 3 S S iteW"t5y553 VU Tr

$ f  f 3 P P 3 q 3 3 3 S iteWy55y553 VU S r

$ f  3 3 3 3 f 3 f S 5"55ut3 VU S r

$ f  3 3 3 f 3 f 3 S 5"5tu53 VU S r

The set of special points

together with their labeling is as follows:

'f#8 3 v $ 3 5@ 7 

The proof is via Theorem 6. An instance graph problem is mapped to the following set of examples in

 V

f t

 i

$  8r

w v

6

$   

 n

 y

7 $  3  8"r

$ f  v i'r

 #y

0 V)

for

p1

p2

p7 p4 p5 p6 p9

p8

p3

Fig. B.1. Points which exclude Case , the black points are to be mapped to , the white points are to be mapped to . Some possible separating lines are also depicted.

was presented in Step (2) of Theorem 9, that . Considering the four values , , , and the following cases can be found for the curve in the plane spanned by and . Note that because of the equality lin lin we can perform similar weight changes as in the sigmoidal case without changing the mapping. To be more precise, one can rst substitute the activation function lin by the activation function lin with lin lin , perform exactly the same weight changes as in the sigmoidal case since lin possesses the same symmetry as , and substitute lin by lin, afterwards. Case 1: All values are or all values are . Then there would exist misclassied points in . Case 2: Exactly one value is positive. We can assume that by an argument similar to Case (2) in Step (2) of Theorem 9. If and are parallel then the positive region is convex and separated from the negative region by at most two lines with normal vector parallel to . If and are linearly independent then the positive region is separated from the negative region by the lines dened by , , and . The positive region consists of the intersection of the three halfspaces dened by these lines. Hence the positive region is convex and separated from the negative region by a continuous curve with at most three linear pieces with normal vectors parallel to , , or a convex combination of and , respectively. Case 3: Exactly two values are positive. We can assume that and are positive and all other values are negative by an argument similar to Case (3) in Step (2) of Theorem 9. If and are linearly dependent then the positive region is separated from the negative region by up to three parallel lines. If and are linearly independent then one obtains the separating lines , , and . That means that the positive region is separated from the negative region by a curve consisting of three linear pieces. Two of them have a normal vector parallel to . Denoting by , , and , respectively, the halfspaces dened by the points with the weakened conditions , , or , respectively, the positive region lies in the set or depending on the sign of . 46
y 7 x x $ x A   z wF )  B ) # h z d5}   '9# 7 F x B 3  7  "# 

Xf

( ( $  $  i x x $ x A z i5F )  B ) r

7 $ 2

z x )

$ P tWyU2

e97 z #5F x x  B $ x A   8r2# ) ) y5Dx uh z g7  "# F B 3 

xh

7 x  9

7 $ 82

  h ur$

z  

uh z 

x Tx z

F x 5 8'h z B 3

uf

F x B 3 h  8'$ z ) 

x z
B

dx z
A

7  u# 

$ f W 
A A

Case 4: Three values are and one value is . This case is dual to Case . We obtain the possible forms by changing the output sign, i.e. the positive and negative region.
 P

Next we show that only Case


f

can take place if

is classied correctly.

Obviously, Case cannot take place. Case is excluded by the points which are nonvanishing in dimension to : we need three separating planes which must have normal vectors with signs , , or , respectively, in order to separate the positive points. But this cannot be the case if one of the vectors is a convex combination of the other two vectors. Case can be excluded by the points with nonvanishing coefcients in the last but one and last but two components (see Fig. B.1). The points are separated by a curve consisting of linear pieces two of which are parallel. One line must separate two of the pairs , , , and , without loss of generality, say the rst two pairs. This line cannot separate another pair or the point because the line through and intersects the axes at . Since a parallel line and cannot separate both other pairs we nd that the two other pieces separate , or , respectively, the latter being parallel to the rst line. But since the pair lies on the same side of the second line as , we nd that the positive part would be contained in the intersection of the respective halfspaces.
P  $ x 3 3  $ x 3 x 3  $3x3x@ xt`n f x t`

Hence every classication which maps

correctly is of Case .

Now we show that the correspondence demanded in property (i) of Theorem 6 holds. An optimum solution of the MAX- -cut problem leads to a solution with weights , , , ,
P W

where
 iy

if
P

is in the rst cut

otherwise

and
 iy

if
P

is in the second cut

otherwise 47

3  

wr 3

7 wF

3 $ P P q 3 P 3 P 3 f 3 f 3 f tYl5eWwyQwttu3

3 $ P P q 3 P 3 P 3 f 3 f 3 f tYt"yWyeW"ttt3

P W

7 y

P W

7 z

f g7

3   v
F

3 53  A  7

3 F 7 3  6B

7 4

7  t

7  "F

Fig. B.2. Classication in Case . The black points are to be mapped to , the white points are to be mapped to . Possible separating lines are also depicted.
 np

With these weights at most the points are misclassied.

corresponding to monochromatic edges

Assume conversely that we are given an optimum solution of the loading problem. correctly. Hence the classication is of Case . Without This solution classies loss of generality, assume that (computing the appropriate weight changes in polynomial time, if necessary) and all points are correctly classied (decrease the weight in every neuron, if necessary, such that the sum of the weight and the respective threshold is ). Dene a solution of the MAX- -cut problem such that exactly those nodes are in the rst cut where . Now it needs to be shown that any edge for which is classied correctly is bichromatic. Assume for the sake of contradiction that this is not the case. Then we nd in the dimensions and the situation depicted in Fig. B.2 where the positive region is convex and either separated from the negative region by two parallel lines with normal vector or by three lines with normal vector , , and a convex combination of and . A line separating or , respectively, has necessarily a normal vector with two different signs and a negative or component, respectively. If had two equal signs the separating lines could not contain two lines with these properties. Finally, property (ii) of Theorem 6 can be shown: given a weight setting such that is classied correctly we can assume that Case takes place and hence . Dene a solution of the MAX- -cut problem where is in the rst cut if and only if . If is monochromatic , , or is misclassied which can be shown using the same argument as before.
h z  uy  y P  t P $ y 3 y "Fir h  t 
B A

Proof of Theorem 13:

The proof is analogous to Theorem 6. We obtain the following inequality, for an optimum solution with value opt of an instance of the MAX- -cut problem of an instance of the corresponding and an optimum solution with value opt loading problem, because of (i): 48
a (  ( $ r $  r

h  8

 1

 y

$ y  y 3 #r  'y

Hence can be chosen as . Any approximate solution of with relative error of smaller than classies at least one point of each set of the special points correctly because otherwise
( 6 $ $ q x a h $ f a 7 t ( pi` $ f a h @ppa  (

Proof of Corollary 14:

A reduction from the MAX- -cut problem as stated in Theorem 13 would show that approximate loading is NP-hard within relative error for . The arguments in the theorems in previous sections can be transferred to this new settings. Note that the origin is used in Theorems 8, 9, and 12 for two different purposes: regarded as a point in it is used to exclude certain geometrical situain the graph to guarantee tions and, additionally, it is used for every edge the correspondence of bichromatic edges and correctly classied points . Making this latter role explicit we introduced . Note that we constructed explicit weights such that only examples corresponding to monochromatic edges were misclassied. The output activation allows a slight change in every case because it was not exactly in the threshold or semilinear case and of absolute value larger than in the sigmoidal case. Furthermore, 49
$ y  y 3 'r  np $ y 3 y "@ur  V $ y 3 y "Fur 6 $ $ q x  a  h 9'# upf a P a

Hence the NP-hardness of approximate loading follows.

( ( $ 6$ r

`n

 6$  r

opt

opt

( $ r

 

$ q x  h $ $ a h f f  P x x d  `nrprpti`nV#d d  $ $ a h P q x  $ P x   h $ a h q x $ a h P   ttxrtprptpttr

for

because opt . The above inequality cannot hold due to the denition of the relative error. If at least one of each set in is correct we obtain because of (ii), and denoting the cost of a solution of or , respectively:

q x `n' `n

( $ r

q x $ f  d &dnV

P x p $ f  e f x p q x `n `n

( $ r

opt

opt

opt

$  r

opt

f a $  2 a dnV `n q x x P x `ndn1t `n

P Q d $5i555y53 VU S   3 x f 3 3 3 S $  3 f 3 3 3 S  5i5i5"53 VU S r $  3 x f 3 3 3 S "#5&55y53 VU S  $  3 x f 3 3 3 S  "#5i5"53 VU S r $  P 3 P f 3 3 3 S 5yQ5wQt5y553 VU S r $  P 3 P f 3 3 3 S  5eW"yWu5"53 VU S r $  P 3 x P f 3 3 3 S 5yQ5Wuy55"3 XU S  $  P 3 P f 3 3 3 S  t5yQ5yQt5"53 VU S r $ f  3 3 3 3 S S ii5y55"3 VU T $ f  3 3 3 3 S S  i#5iw5"53 VU Tr $ f  3 x 3 3 3 3S S #"655"5VU Tr $ f  3 3 3 3 S S  ii5i55"53 VU Tr $ f  P 3 x P 3 3 3 S iyQ5Q55y553 VU S r $ f  P 3 P 3 3 3 S  ieW"yWy55"53 VU S r $ f  P 3 P 3 3 3 S iyQ5Wyy55"3 XU S  $ f  P 3 P 3 3 3 S  yQ5yQw5"53 VU S r

$ !  f "t3 q

S f XU S

r

Hence for every modication of the training set such that the points lie in a lie in a small neighborhood of , and lie in a small neighborhood of , small neighborhood of the origin we can nd a solution of the same quality as before.

no activation was exactly in the threshold case even for the other neurons in the hidden layer. Hence the continuity of the network function allows us to nd some such that the classications of the respective points do not change if they are substituted by any point contained in the open ball of radius whose center is the respective point. Note that does not depend on the specic instance in all cases.

Next we show that every point in the respective solutions can be substituted by any point in an appropriate set such that the same geometrical situations are excluded:

In Theorem 8 the point can be substituted by and the point can be substituted by for some which can be chosen independently for each pattern; we can substitute by where is any point in a neighborhood of depending on . Still the points obtained from and guarantee that the neurons in the rst hidden layer nearly coincide with the hyperplanes used in the construction, the points determine the remainder of the network to compute the logical function AND. We can substitute the points in Theorem 9 as follows: by , by , by , by , by , by , by , by . is chosen independently for each vector. Case is still excluded with the same argument. We substitute (1) (2) (3) (4) (5) (6) (7) by the points , (1) (2) ,
$55y5ttt3 VU S   3 3 f 3 f 3 f S $55y5t"53 VU T  3 3 f 3 3 S S $55y55t53 VU T  3 3 3 f 3 S S $55y55"t3 VU S   3 3 3 3 f S $i5y5tt53 VU S  f  3 3 f 3 f 3 S $i5y55tt3 VU S  f  3 3 3 f 3 f S $ f  3 3 3 3 S i5y55"53 VU S   t   q $!"t3  f3 VU S r  f S w v $ 53  5w3  53  56 3  3  3 q w v w' v $ 53  53 q 53  53  5i 3   3 w v 3 53  5"3  5G3  5i t  3  3  7 $  g  1 |  | 

50
$ h h unf $if"35353 VU S  3 f 3 f S $ f  3 3 3 3 S i5y55"53 VU S 

Hence the points can be substituted by points in appropriate line segments which are to be intersected with the regions in which the classication of the optimum solutions constructed above does not change. Since the number of points and their multiplicities are polynomial, appropriate coefcients of the substituting points can be computed in polynomial time.

(3) (4)

( P Q 3 53 "3  "U ( ( ( $"553 uQ83 uWy83 3 VU S  3 P x f P x f x f S $55y53  83  uQ83  uWy83 VU S   3 x f P x f P x f S ( ( ( $5"53 uQ3 3 uQw3 VU S   3 P x f P S $  3 x f P P S S 5y553  83  uQw3  uQw3 VU T $5"53 w3 uQ3 uQw3 VU T 3 P x f P S S ( ( ( $5y553 'Ww3 3 uQw3 VU S r  3 P x f P S $ h uf $  3 P P x f S 5y553  'Ww3  uWy83  w3 VU S  ( ( ( $ 5"353 uQw3 uQw3 3 VU S  P P x f S $u h $ 5y5353 P P x f S  'Ww3  w3  uWy83 VU S  $ f  3 3 f 3 f 3 S i"553 VU S 

(7)

(6)

(5)

and , where are chosen independently for each point. Note that some points are substituted by two or three sets, respectively, corresponding to their different role to exclude Case . For example, the points form a triangle enforcing the following: if the point is to be separated by any hyperplane from this triangle then the normal vector of the hyperplane has necessarily the signs . Analogous triangles can be found for the other two positive points and hence Case is excluded. In Theorem 12 the same substitutions can be performed for the points to exclude Case . The points to exclude Case become

$ h uf

with nonzero coefcients at positions and , and substitute ) which together with their labeling are

P xx $  f 3 x 3 3 3 3 S  "tlyy55"53 VU S r $t5tyQyEQw5"53 VU S r  f 3 P P 3 x P q 3 3 3 S  $ f  f 3 P P 3 q 3 3 3 S S  iteWyw5"53 VU Tr $  f 3 P P 3 P q 3 3 3 S  t5tyQyEwWy55"53 VU S r $ f  f 3 P P 3 x q 3 3 3 S  ityWeE55"53 VU S r $t5tyQ5Qt5"53 VU S r  f 3 P 3 x P f 3 3 3 S  $iteW"5"53 VU S r f  f 3 P 3 f 3 3 3 S  $  f 3 P 3 P f 3 3 3 S  t5tyQ5wWu5"53 VU S r $ f  f 3 P 3 x f 3 3 3 S S  ityWy55"53 VU Tr

$ x 3 x 3 x @

$ h h unf

$ h uf

Dene the points

for some

and

and

and

together with their labeling by

chosen independently for each point.

, ,

by points

|

 1

51
$  3 3 3 3 f 3 3 3 "5""  3 3 f t5t $ f  3 3 3 3 3 3 3 3 3 i5w"5w"  V

where can be chosen independently for each point. The points have the following property:
 y  V

The normal vector of any line separating or from or , respectively, in dimensions and .
| | $ x 3 

Hence we can establish properties (i) and (ii) as follows. An optimum solution of an instance of the MAX- -cut problem gives rise to a solution of the corresponding instance of the loading problem where at most the points corresponding to monochromatic edges are misclassied for small parameters . Conversely, any optimum solution of the loading problem classies for each at least one correct and hence the same geometrical situations as before take place. Furthermore, we can assume that for each at least one is correct, otherwise a weight change (which is computable in polynomial time) would lead to a solution of at least the same quality and all correct. Without loss of generality, if we adapt the proof of Theorems 9 and 12; or the second part of the network computes the logical function of its inputs if we adapt the proof of Theorem 8. We can dene the cut via the neuron in the rst hidden cut , putting all remaining layer maps to and is not in the , . . . , nodes in the rst cut if we adapt the proof of Theorem 8. We can dene the rst cut via if we adapt the proofs in Theorems 9 and 12. At most those edges are monochromatic where either or is misclassied because only one hyperplane cannot separate both and from and in Theorem 8 or because of in Theorems 9 and 12. Hence Property (i) holds. Property (ii) follows in the same way.
|

52

gh

|

 1

 1

$ f pi t  y &#i

 1  p

 p

|

w 'v

 #y

|

w 'v

$ y  y 3 #2 h   y yt#i

P W

yW 3 3  2 $ h uf |

and

has signs

$3@ x $ h u

You might also like