You are on page 1of 6

A Constructive Algorithm for Feedforward Neural Networks

Jinhua Xu , Daniel W. C. Ho and Yufan Zheng

Institute of System Science, East China Normal University jhxu@cs.ecnu.edu.cn Department of Mathematics, City University of Hong Kong madaniel@cityu.edu.hk

Department of Electrical and Electronics Engineering The University of Melbourne y.zheng@ee.mu.oz.au


Recent interest has also been growing on pruning algorithms that remove hidden units or connections (weights) from an oversized network to improve the generalization capability of a neural network [12]. Most of these methods are post-training pruning. The training problems are formulated as an optimization problem in terms of an error function which is dened as the dierences between the desired output and the network estimated output. Hence, by adding a penalty term to the objective function, the corresponding unnecessary connections will have small weights in the networks. As a result, the complexity of the network can be reduced signicantly by pruning. In post-training pruning methods, a sensitivity measure is computed after the training process, which gives an indication on the expected increment on the error function after the corresponding units or weights are eliminated from the network ([2, 3, 4, 7, 10, 13]). If the sensitivity measure is below a certain threshold, the connection is removed. The resulting network needs to be retrained if the increment in the error function is larger than an acceptable limit. One of the disadvantage of post-training pruning algorithms is that the computational cost is heavy since the majority of the training time is spent on networks larger than necessary [9]. In [17], a subsetbased training and pruning method is proposed based on Jacobian rank deciency, an extra penalty term on weights is added to the objective function corresponding to weight pruning. Since only a subset of weights is trained during the training process, the computational cost of the training algorithm is reduced signicantly. However, the redundancy of weight cannot be estimated by using the rank deciency of the Jacobian matrix, therefore, the redundant weights cannot be removed directly according to the rank deciency. The convergence of the algorithm cannot be guaranteed and some extra tuning parameters are required in

Abstract
In this paper, a training and pruning algorithm is proposed for feedforward neural networks based on Jacobian rank deciency. Redundant nodes will be pruned during the training process, and computational cost of the training algorithm will be reduced signicantly. Simulations are presented to demonstrate the eectiveness of the proposed approach.

1 Introduction
The feedforward neural network is one of the most popular neural network topology. One of the diculties of using feedforward neural network is to determine the optimal number of hidden units before the training process. In general, when a network is too small, it may not learn the problem well, however a too large size network will lead to overtting and poor generalization performance. Two dierent approaches for solving this problem have been proposed. The rst approach begins with a minimal network and adds more hidden units only when they are needed to improve the learning capability of the network [9]. The second approach begins with an oversized network and then prunes the redundant hidden units or connections [12]. Many research has been done on algorithms that start with a minimal network and then dynamically construct the nal one. These algorithms include the dynamic node creation [1], projection pursuit regression [8], the cascade correlation algorithm [5], resourceallocating network [11] and the self-organizing neural network [14]. The aim of these algorithms is to nd a network with the simplest architecture possible that is capable of solving a given problem.

the training and pruning process. In this paper, a new training and pruning algorithm is proposed based on the relationship between node redundancy and Jacobian rank deciency. At each training iteration, QR factorization is applied to the output of the nodes in the same hidden layer to identify the redundant nodes based on the correlations among nodes. The redundant nodes will be disconnected with the output layer by setting their output weight as zeros. The output weights of the remaining nodes will be recalculated with the objective to maintain the original input-output behavior as much as possible. Then, only the weights of the remaining nodes will be trained using Levenberg-Marquardt (LM) algorithm at this iteration. In this way, the redundant nodes will be pruned during the training process, and the computational cost of the training algorithm will be reduced signicantly. Furthermore, in contrast with the existing algorithms [17], our algorithm does not use any tuning parameter, and this avoids an unnecessary lengthy tuning phase.

Then can then be represented as


T S T , T T , , = [ 1 m m+1 ] R .

(2)

where S = m (n + 1 + p) + p is the number of the weight parameters of the network. With batch learning, a set of training pattern {x(t), y (t), t = 1, , N } is given, here x(t) Rn and y (t) Rp are the input and desired output of the network respectively. The output of the network is y (t, ) = W 2 (W 1 x(t) + B 1 ) + B 2 Dene the error e(t) := y (t, ) y (t), (4) (3)

then the training problem can be formulated as an optimization problem as follows: minV () =

1 2

N t=1

eT (t)e(t) =

1 2

e2 i (t) =

t=1 i=1

where E T = [eT (1), eT (2), , eT (N )].

1 T E E 2 (5)

2 Problem formulation
The neural networks considered in this paper are feedforward neural networks. For simplicity, the algorithm is derived for feedforward neural networks with one hidden layer. The activation function of the hidden layer and output layer are sigmoid and linear function respectively. It is straightforward to extend the algorithm to multiple hidden layer networks. For one hidden layer feedforward network, the inputoutput relationship can be represented as y = W (W x + B ) + B
2 1 1 2

Let the Jacobian matrix be J (1) E J (2) J () = = . . . J (N ) (i) = e(i) = J


e1 (i) 1 e2 (i) 1 ep (i) 1

e(1) e(2) e(N )

, (7) (6)

. . .

. . .

e1 (i) 2 e2 (i) 2 ep (i) 2

. . .

. . .

e1 (i) S e2 (i) S ep (i) S

. . .

(1) where

where x Rn , y Rp are the network input and output respectively. W 1 Rmn are the weights from the input layer to the hidden layer, and B 1 Rm is the bias of the hidden neuron, where m is the number of the hidden neurons. W 2 Rpm are the weights from the hidden layer to the output layer, and B 2 Rp are the bias of the output neuron. In this paper, B 2 is regarded as the weights from a bias unit with a constant output 1 to the output layer, and is a dierentiable activation function, such as sigmoid functions. Assume that the hidden neurons in the network are numbered from 1 to m. All the weight parameters related to the ith hidden neurons including all the incoming and outgoing connections and the bias are grouped i , i = 1, m, here in a Si dimensional vector i = [w1 , , w1 , b1 , w2 , , w2 ]T . i1 in i 1i pi m+1 , All the biases in the output layer are grouped in that is m+1 = [b2 , b2 , , b2 ]T . 1 2 p

ei (t) 2 = wi,i i (t)xj (t), 1 wi ,j i = 1, , m, j = 1, , n, ei (t) 2 = wi,i i (t), b1 i i = 1, , m, ei (t) = i,i , b2 i i = 1, , p ei (t) = i,i j (t), 2 wi ,j i = 1, , p, j = 1, , m.

(8)

(9)

(10)

(11)

1 Here i (t) = (Wi1 x(t) + Bi ) is the output of the ith 1 th hidden node for the t training pattern, Wi1 and Bi th 1 1 are the i row of W and B respectively, is the derivative of .

The BP algorithm has been widely used for training multilayer perceptron (MLP) networks. However, several drawbacks of the BP algorithm have been observed and it is known that the Newtons method and its modied versions can be applied to overcome these drawbacks. The Hessian matrix is dened as H= 2V T 2ET = E (J E ) = J T J + 2 2 (12)

3 Training and pruning algorithm


In this section, the relationship between node redundancy and Jacobian rank deciency is analyzed, and a new training and pruning algorithm is proposed. The signicance of this approach is that the redundant nodes will be pruned in the training stage, and hence the computational cost of the training algorithm will be greatly reduced. First, we describe the method to identify the redundant nodes. At iteration k , the output of the ith hidden node for the j th training pattern is denoted as i (j ) = (Wi1 x(j ) + 1 Bi ). Dene matrix (k ) := [1 , 2 , , m , m+1 ] (17)

Neglecting the second term, the Hessian matrix is approximated by the rst term H JT J =
N i=1

T (i)J (i) J

(13)

The Gauss-Newton method can be obtained by solving H = J E


T

(14)

where i = [i (1), i (2), , i (N )]T , i = 1, , m, and m+1 (j ) = 1 for j = 1, , N is the output of a bias unit for all training patterns. The output of the network at iteration k is denoted as (k ) := [ Y y (1), y (2), , y (N )]T and let W := [W 2 , B 2 ], then we have (k ) (k )W T = Y (18) (K ) is a linear comThis means that the output in Y bination of the nodes in (k ). Next, QR factorization with column pivoting is applied to matrix (k ) to detect the independent columns, that is (k ) P = Q A (19)

If J T J is invertible, then the Gauss-Newton algorithm can be obtained as follows: (k + 1) = (k ) H 1 J T E (15)

Since J T J may not be invertible, we have the Levenberg-Marquardt (LM) algorithm [6]: (k + 1) = (k ) (H + I )1 J T E (16)

where is a positive number, which can be adjusted appropriately. Whenever an update results in an increased V (), then the update will be neglected and is multiplied by some factor (> 1). When an update reduces V (), is divided by . When is large, the algorithm becomes gradient descent(BP); while for small , the algorithm becomes Gauss-Newton. The Levenberg-Marquardt (LM) algorithm can be considered a trust-region modication to Gauss-Newton algorithm[6]. The Levenberg-Marquardt (LM) algorithm outperforms the basic backpropagation (BP) and BPs variations with variable learning rate signicantly in terms of training accuracy, convergence properties and overall training time. But one disadvantage of the LM algorithm is that the computation and memory requirement are high within each iteration. Notice that the rank deciency of Jacobian matrix in (6) plays an important role in the implementation of this algorithm. Rank deciency of the Jacobian matrix on one hand makes the Gauss-Newton algorithm (and some other high-order algorithms) not applicable, and on the other hand it may indicate that some hidden units in the network are redundant. We shall discuss that in the next section.

where Q is a N N unitary matrix, A is an upper triangular matrix of the same dimensional as , P is a permutation matrix with P T P = P P T = I , which is chosen to make the absolute value of the diagonal elements of A in decreasing order. If there are r independent columns in matrix (k ), then aii = 0 for i = 1, , r and ajj = 0 for j = r + 1, , min(m + 1, N ). Decompose A in appropriate block matrix forms as A= A11 0 A12 0 (20)

where A11 is an r r upper triangular matrix. := W P , we have Substitute (19) into (18), let W (k ) T = QT Y AW (21)

in appropriate Correspondingly, decompose Q, W block matrix forms as Q = [Q1 , Q2 ], = [W r , W p ] W (22)

here W r and W p are p r and p (m + 1 r) matrices, representing the output weight of remaining nodes and redundant nodes respectively. One can verify that

QT 2 Y (k ) = 0. Disconnect the redundant nodes by setting W p = 0, the remaining weight W r can be easily obtained by solving the following linear equation. A11 (W r )T = QT 1 Y (k ) (23)

i are dependent Therefore, the n + 1 + p columns of J on other columns in J , the rank of J will be decreased by n + 1 + p. Generally, if the ith node is redundant, that is, i is dependent with other columns in , i.e., i = ni j =1 j j , j [1, 2, , m + 1]\{i}, j = 0, then all i will be dependent on other the n + 1 + p columns of J columns in J . Therefore, the redundant nodes identied by QR factorization have a direct relationship with the rank deciency of the Jacobian matrix. As a result, we can remove the redundant nodes to reduce both the rank deciency of the Jacobian matrix and the computational cost. Rearrange the columns of J as follows: J = [J r , J p ] l , J l , J l ] is the Jacobian matrix of here J r = [J 1 2 r l , , J l the remaining nodes and J p = [J ] is the r +1 m+1 Jacobian matrix of the redundant nodes. Since the columns of J p is dependent on those in J r , we have Jp = Jr L (27)

From the above analysis, we know that we can disconnect the m + 1 r redundant node and readjust the output weight of the remaining node without any eect on the eect on the networks input-output behavior. Next we analyze the relationship between node redundancy and Jacobian rank deciency. Denote the Jacobian matrix in (6) as 1 , J 2 , , J m+1 ] J () = [J where i = E = [ E , , E , E , E , E ] J 1 1 2 2 i wi win b1 wpi 1 i w1i (25) for i = 1, , m and m+1 = J E E E , , 2 ]. m+1 = [ b2 b p 1 (26) (24)

We have two dierent cases to investigate as follows: Case 1. constant-output node. If the ith node is a constant-output node, that is i (t) = , t, where is a constant. This means that the ith node is in a saturation region for all training patterns, then i (t) = 0, t. From (8)-(9) we have E E 1 = 0, for j = 1, n and b1 = 0; wij i From (11)-(10), we have E E 2 = b2 , for j = 1, p. wji j i are dependent Therefore, all the n +1+ p columns of J on other columns in J , the rank of J will be decreased by n + 1 + p. Case 2. Paralleled or anti-paralleled nodes If the ith node is paralleled or anti-paralleled with j th node, then i (t) = j (t), t, where is a real constant. Assume that the ith node is redundant and is 2 disconnected, that is, wk,i = 0, for k = 1, , p. From (8)-(9) E E 1 = 0 i = 1, n and b1 = 0; wik i From (11), we have E E 2 = w 2 , for k = 1, p. wki kj

for some appropriate size matrix L with nonzero columns. Correspondingly, the weight vector can be rearranged as follows: l , , l , l , , l = [r , p ] = [ ]. 1 r r +1 m+1 Then (14) can be decomposed as (J r )T J r r + (J r )T J r Lp = (J r )T E (28)

LT (J r )T J r r + LT (J r )T J r Lp = LT (J r )T E (29) The second system equation in (29) can be obtained by multiplying LT on both sides of the rst system equation in (28), therefore, (29) is dependent on (28). If we choose p = 0, that means the weight parameters related to the redundant nodes remain unchanged at this iteration, then we have (J r )T J r r = (J r )T E (30)

Therefore we can update the weights of the remaining nodes using the LM algorithm as follows: r (k + 1) = r (k ) ((J r )T J r + I )1 (J r )T E (31)

Since (J r )T J r may still be not invertible, the term I is added to avoid such ill condition, where is a small positive scaler. Remarks: (1) At each iteration, only the Jacobian matrix of the remaining nodes J r is needed to be calculated, J p and L are only there for analysis;

(2) the size of J r and r is much less than that of J and , which will be demonstrated in the simulations later, the computational cost will be decreased signicantly. Notice that the redundant nodes are disconnected at each iteration and are removed after training. The dierences between our algorithm and the one in [17] are as follows: First, QR factorization is applied to the output of the hidden nodes to detect the redundant nodes in our algorithm; in [17], QR factorization is applied to the Jacobian matrix to detect the redundant weights. Second, since there is a direct relationship between node redundancy and Jacobian rank deciency, no tuning parameter is needed in our algorithm; In [17], the redundancy of weight cannot be estimated by using the Jacobian rank deciency, therefore, the weight cannot be removed directly according to the rank deciency. An extra penalty term on weights is added to the objective function corresponding to weight pruning, and some tuning parameters are needed to remove the redundant weights. Third, the removal of the redundant nodes has no eect on the networks output, therefore the training error function will decrease monotonically during the training process in our algorithm. In [17], the convergence of the algorithm depends on some tuning parameters. The training error function may be increased so much that the training process may diverge.

the mse for checking data is increased to 7.3093e-2. Obviously, overtraining occurs. The approximation result for checking data is shown in Figure 1(a), it can be seen that overtting occurred around the input within (0.4, 0.5) and (0.7, 0.8). Then our proposed training and pruning algorithm is used to train MLP. After 290 iterations, local minimum is reached. As shown in Figure 1(b) and Table 1, the approximation result is much better than that of LM algorithm. The plot of the number of hidden nodes is shown in Figure 2. It can be seen that after training, the number of remained hidden neurons is 17, far less than 41. Table 1: Comparison of the approximation result algorithm training LM LM & pruning iterations 290 300 376 # of hidden 17 41 41 neurons mse of training 4.5203e-4 5.4436e-4 1.0447e-4 data mse of checking 2.4410e-2 5.5349e-2 7.3093e-2 data

4 Simulation results 5 Conclusion


Example 1. Multistep function approximation([15]) In this example, the training and pruning algorithm is used to train the multilayer perceptrons(MLP) to approximate a multi-step function. The 100 training data are randomly distributed on [1, 1]. In [15], BP algorithm is used to train MLP to approximate the multistep function. It is concluded that the MLP captured the global shapes of the function, but not the transition (sharp change) of the function. In our experiments, the number of initial hidden nodes is chosen to be 41, the same number as that in [15]. The initial weight parameters are random number on (0.5, 0.5). After training, the network is checked for uniformly distributed data on (1, 1). First, the LM algorithm is used to train MLP. After 300 iterations, the approximation result of MLP is acceptable, as shown in Table 1. Therefore the poor approximation result of MLP in [15] is caused by the slow convergence of the BP algorithm. If we increase the number of iterations of LM algorithm to 376, local minimum is then found by observing that the training error does not decrease anymore even when = max . The mse for training data is reduced to 1.0447e-4. However, A training and pruning algorithm based on the concept of Jacobian deciency is developed in this paper, which combines the training and pruning of a network into one comprehensive procedure. The goal is aiming at reducing the computational cost while reducing the number of networks nodes to overcome overtting. The algorithm is quite general and can be applied to any feedforward neural networks. Notice that the redundant nodes are identied based on the linear dependency among hidden node outputs, which is mostly caused by saturation characteristics inherent in the sigmoid function. Therefore the proposed algorithm is more suitable to MLP. For feedforward neural networks with other activation function, an ecient redundant node selection algorithm is under investigation. Acknowledgement This work is supported by a grant from RGC (CityU 101103).

References
[1] Ash, T., Dynamic node creation in backpropagation networks, Connection Science, vol.1, No.4, pp.365-

2.5

18

16

1.5
14

1
# of remained nodes 12

0.5

10

0.5

1
6

1.5
4

2 1

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

50

100

150 iterations

200

250

300

(a) LM training algorithm


2.5

Figure 2: The plot of the number of hidden neurons during training process

1.5

0.5

ing(5), S.J. Hanson (Ed.), San Mateo, CA: Morgan Kaufmann, pp.164-171, 1993. [8] Hwang, J.N., S.R. Lay, M. Maechler, D.Martin, and J.Schimert, Regression modeling in backpropagation and project pursuit learning, IEEE Trans. on Neural Networks, vol.5, pp.324-353, 1994.
0 10 20 30 40 50 60 70 80 90 100

0.5

1.5

(b) training and pruning algorithm Figure 1: The simulation result for example 1

[9] Kwok, Tin-Yau and Dit-Yan Yeung, Constructive algorithms for structuer learning in feedforward neural networks for regression problems, IEEE Trans. on Neural Networks, vol.8, No.3, pp.630-645, 1997. [10] Leung, Chi Sing, Kwok Wo Wong, Pui Fai Sum and Lai Wan Chan, On-line training and pruning for recursive least square algorithm, Electronic Letters, vol.32, No.23, pp.2152-2153, 1996. [11] Platt, J., A resource-allocating network for function interpolation, Neural Computation, vol.3, pp.213-225, 1991. [12] Reed Russell, Pruning algorithm-A survey, IEEE trans. on Neural Networks, vol.4, No.5, pp.740-747, 1993. [13] Setiono, Rudy, A penalty-function approach for pruning fedforward neural networks, Neural Computation, vol.9, pp.185-204, 1997. [14] Tenorio, M.F. and W.T.Lee, Self-organizing network for optimum supervised learning, IEEE Trans. on Neural Networks, vol.1, pp.100-110, 1990. [15] Zhang, Jun, Gilbert G. Walter and Wan Ngai Wayne Lee, Wavelet neural networks for function learning, IEEE Trans. on Signal Processing, vol.43, No.6, June 1995, pp.1485-1497. [16] Zhou, G. and J.Si, Advanced neural network training algorithm with reduced complexity based on Jacobian deciency, IEEE Trans. on Neural Networks, vol.9, No.3, pp.448-453, 1998. [17] Zhou, Guian, J.Si, Subset-based training and pruning of sigmoid neural networks, Neural Networks, vol.12, pp.79-89, 1999.

375, 1989. [2] Chang, Sheng-jiang, Kwk-wo Wong and Chi-Sing Leung, Periodic activation function for fast online EKF training and pruning, Electronic Letters, Vo.34, No.23, pp.2255-2256, 1998. [3] Chung, F.L. and T.Lee, A node pruning algorithm for backpropagation networks, Int. Journal of Neural Systems, vol.3, No.,3, pp.301-314, 1992. [4] Le Cun, Yann, John S. Denker and Sara A. Solla, Optimal brain damage, Advances in Neural Information Processing(2), D.S. Touretzky, (Ed.), San Mateo, CA: Morgan Kaufmann, pp.598-605, 1990. [5] Fahlman, S.E. and C. Lebiere, The cascadecorrelation learning architecture, Advances in Neural Information Processing(2), D.S. Touretzky, (Ed.), San Mateo, CA: Morgan Kaufmann, pp.524532, 1990. [6] Hagan, Martin T. and Mohammad B. Menhaj, Training feedforward networks with the Marquardt algorithm, IEEE trans. on Neural Networks, vol.5, No.6, 1994. [7] Hassibi, Babak and David G. Stork, Second order derivatives for network pruning: Optimal brain surgeon, Advances in Neural Information Process-

You might also like