Adriana Dumitras and Adrian Traian Murgan
Electronics and Telecommunications Faculty “POLITEHNICA” University of Bucharest 1- 3, Iuliu Maniu Blvd., 77206 Bucharest Romania
Abstract: A comparison of three pruning algorithms for feedforward neural networks design is proposed, namely MOBD (Modified Optimal Brain Damage), TOBD (Tridiagonal Optimal Brain Damage) and OBD (Optimal Brain Damage). The application taken into account is nonlinear regression and the evaluation criteria are: mean square error, peak signal- to- noise ratio, total number of parameters in the final neural structure, weight values, number of “multiply + add” operations in the test phase and required test time.
Keywords: Feedforward neural network, pruning method, saliency, symmetry
Adriana Dumitras Ph.D, currently she is senior assistant professor, at the Applied Electronics Department, "Politehnica" University of Bucharest. She has published 26 scientific papers on topics in the fields of artificial neural networks and computer architectures. She was a recipient of the 1993 TEMPUS grant for carrying out a research period at the Technical University of Denmark.
Adrian Traian Murgan Ph.D, Professor and Head of the Applied Electronics Department, "Politehnica" University of Bucharest. He has published more than 100 scientific papers on information theory, neural networks and fuzzy systems. He was a recipient of the "Alexander von Humboldt" Stiftung grant. He was visiting professor at several German universities. He is a member of IEEE .
1. Introduction
As powerful nonlinear signal processing tools, neural networks have been extensively applied in pattern recognition, classification, filtering and estimation, etc. Attempts to optimize the neural network structures have been made, in order to avoid overfitting and improve generalization, to obtain higher convergence speed and less costly implementations.
There are several techniques for optimizing the neural architecture [1, 2]. Basically, they are: empirical methods, methods based on statistical criteria [1, 2, and others], growing (constructive) [7- 13, and others], decreasing (destructive, pruning) [14- 20, and others] and hybrid methods [21- 24]. Reviews of the growing [3- 5,] and decreasing [6] techniques are also available.
Largely discussed during the last few years [6, 14- 20, and others], mainly with applications in time series prediction, pruning methods lead to compact networks, which show good performance as compared to the starting architecture or to other structures of greater size. Though the resulting configuration is sparsely connected, usually this is not symmetrical. In hardware implementations, or even in software simulations, the designer is sometimes deeply frustrated, due to the lack of symmetry: the weight storage, the programs written for digital signal processors, etc., would benefit from the symmetry of the weight matrices. Many times in circuit theory, symmetry and reciprocity conditions are imposed on circuits. It is clear that a symmetrical neural network structure would be useful. This is the reason why, in the following, we shall make a quantitative comparison of pruning algorithms with symmetry constraints, for feedforward neural networks.
2. Theoretical Approach
A survey in pruning algorithms [6] quotes two broad classes of pruning methods: those which estimate the sensitivity of the error function to the removal of an element and secondly, methods which add terms to the objective function that rewards the network for choosing efficient solutions. There is some overlap of these two groups, since the objective functions could include sensitivity terms. The destructive algorithms evaluated in the following, belong to the former group, including also symmetry constraints [25- 27].
2.1 The Optimal Brain Damage Algorithm
In the Optimal Brain Damage Algorithm (OBD) method, the feedforward neural network (FANN) is trained and the weight saliencies are calculated. The weights with the lowest saliencies are eliminated and finally, the network is retrained.
The weight's saliency is defined as the change in the training error when the weight is eliminated and the remaining weights are retrained to the new minimum [17, 18]. Basically, the saliency is approximated by the second derivative of the cost function w.r.t. the weight.
The OBD technique was applied to pruning FANNs in contiguity problems, time series prediction, etc. [17, 18].
2.2 The Modified Optimal Brain Damage Algorithm
[25, 26] proposed, instead of a simple elimination as in standard OBD, a symmetric pruning of the weights. In the Modified Optimal Brain Damage Algorithm (MOBD), at each step, the weight having the lowest saliency is eliminated, together with its symmetric (as position in the FANN) “pair”.
MOBD led to sparse weight matrices, which contained zero values in symmetrical positions. The expense of getting symmetrical networks was an increase in the test error [25, 26].
2.3 The Tridiagonal Optimal Brain Damage Algorithm
Another solution of symmetric pruning in feedforward neural networks, the Tridiagonal Optimal Brain Damage Algorithm (TOBD), was suggested in [26, 27]. Without loss of generality, one assumes that the weight matrix has the weight vectors as columns. After calculating the saliencies, the column containing the minimum saliency values is determined and a Householder (reflection) transform is performed on the weight matrix. The weights are normalised and a retraining step follows. After several steps, symmetrical weight values resulted in a tridiagonal matrix. Details of the algorithm and experimental results have been presented in [26, 27].
3. Experimental Results
3.1 The Neural Network Architecture and Learning Algorithm
Let the problem be nonlinear regression and let the FANN be a M- H- N multilayer perceptron (M input, H hidden and N output nodes), trained with the backpropagation algorithm with momentum. We have chosen M = 7, H = 3, N = 1.
The training data consisted of 2880 samples of the “chirp” signal (Figure 1), fed into the network through a 7- sample window, sliding one step to the right. The desired output value was the next value following the window (i.e. the eighth). The training set was learned in 2000 epochs, with a learning rate of 0.01 and a momentum equal to 0.0001. The mean square learning error after 2000 epochs, normalised to the number of patterns was 0.004456.
The test data (Figure 1) consisted of 2880 patterns, different from the training data.

3.2 Evaluation Criteria
The evaluation criteria taken into account are: mean square error scaled to the number of patterns (SMSE) in both training and test phases, peak signal- to- noise ratio (PSNR), total number of parameters (weight, biases) in the final neural structure, weight values, the number of “multiply + add” operations in the test phase and the required test time.
The SMSE cost function is given by (3), where yj(x ) and dj(x ) are the actual, and respectively, the desired output values, with 1 £ j £ N, for each input pattern 1 £ x £ P. One denotes by ej(x ) the output error (i.e. the difference dj(x ) - yj(x )) in node j for the input pattern x and by w, the FANN parameters.

(3)
The PSNR in decibels is given by (4), where dmax and dmin are the maximum and respectively, the minimum value of the desired signal.
3.3 Results of the Quantitative Evaluation
3.3.1 Learning and Test Errors
The MOBD, TOBD and OBD learning errors, test errors and PSNR after the pruning steps are

(4)
shown in Table 2. The comparison was made after 500, 1,000, 2,000, 3,000 and 4,000 retraining epochs, corresponding to the pruning steps shown in Table 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Both MOBD and TOBD have higher test errors as compared to OBD. However, the difference between MOBD and OBD is not significant, while for TOBD further retraining is necessary in order to increase the PSNR.
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
NRE = Number of retraining epochs; LE = Learning error; TE = Test error
Learning curves during retraining for MOBD and OBD are shown in Figures 2 and 3, where the peaks point to the weight elimination steps. No significant differences may be noticed for the two algorithms.
The test errors evaluated with Akaike’s Final Prediction Error (FPE) criterion are lower for all the algorithms taken into account than the errors yielded by the AR(7) and ARMAX(7) models (Table 3).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peak signal- to- noise ratio decreases as the number of the FANN parameters changes (Tables 2, 4). The highest values are given by the FANN in the OBD algorithm, closely followed by MOBD. However, PSNR evaluation after more than 10,000 epochs yields a higher value as compared to the fully connected network output (i.e. better generalization).
|
|
|
|
|
|
Dec. with 1.079 dB |
|
|
|
Dec. with 2.200 dB |
|
|
|
Inc. with 0.290 dB |
|



The FANN resulted from OBD may be further simplified, using more pruning steps. The MOBD structure is symmetrical, like the TOBD net, but in the latter case, only 7 of the 12 parameters have to be stored, as they have symmetrical values. The second hidden node was eliminated in all of the three methods.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The input - hidden and hidden - output weight values resulted from MOBD and OBD, are shown in Figures 5 and 6. The comparison was performed for the same number of retraining epochs. The hidden - output weights at each step in TOBD and OBD are shown in Figures 7 and 8. One may notice that there are no significant differences as to the range of values. However, in TOBD symmetrical connections have been preserved in the structure.

3.3.5 Number of “Multiply + Add” Operations in the Test Phase
The number of “multiply + add” operations in the test phase (Table 6) decreases 4.66 times for TOBD, 2.47 times for MOBD and 1.90 times for OBD structures, as compared to the number of operations in the fully connected network.
3.3.6 Required Test Time
Theoretical evaluations of the required test time [37, 38] and experimental values on an IBM - PC 486 / 66 Mhz computer are included in Table 6. One denotes by t0 the time required by a simple add or multiply operation and by tt the transfer time for 16- bit data.

4. Conclusions and Future Work
We have performed a quantitative comparison of the MOBD, TOBD and OBD algorithms. The first and the second led to symmetrical, sparsely connected networks, while the third led to a non - symmetrical sparsely connected network.
The results show that the structure decreases dramatically in all cases. The lower mean square test error was reached in the OBD case, followed by MOBD and TOBD, but it corresponds to the highest number of parameters preserved in the neural structure.
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Although the range of the weight values is basically the same, in the MOBD and TOBD methods there is no need to store all the final parameters, as some of them have symmetrical values.


The number of “multiply + add” operations is minimum for the FANN given by the TOBD algorithm, followed by MOBD and OBD. The same hierarchy is maintained for the required test time.
The comparison performed in this paper leads to the conclusion that no symmetry may be introduced in the neural model, without paying the price of increasing the error, i.e. decreasing the PSNR. Better performances are likely to obtain if one refines the approximations in the OBD algorithm. Joining this matter of future research, there is another problem we currently focus on, regarding theoretical boundaries in using sparse weight matrices to approximate the FANN input - output relationship.
REFERENCES