A Quantitative Comparison Of Three Design Algorithms for Feedforward Neural Networks

Adriana Dumitras and Adrian Traian Murgan

Electronics and Telecommunications Faculty
“POLITEHNICA” University of Bucharest
1- 3, Iuliu Maniu Blvd.,
77206 Bucharest
Romania

 

Abstract: A comparison of three pruning algorithms for feedforward neural networks design is proposed, namely MOBD (Modified Optimal Brain Damage), TOBD (Tridiagonal Optimal Brain Damage) and OBD (Optimal Brain Damage). The application taken into account is nonlinear regression and the evaluation criteria are: mean square error, peak signal- to- noise ratio, total number of parameters in the final neural structure, weight values, number of “multiply + add” operations in the test phase and required test time.

 

Keywords: Feedforward neural network, pruning method, saliency, symmetry

 

Adriana Dumitras Ph.D, currently she is senior assistant professor, at the Applied Electronics Department, "Politehnica" University of Bucharest. She has published 26 scientific papers on topics in the fields of artificial neural networks and computer architectures. She was a recipient of the 1993 TEMPUS grant for carrying out a research period at the Technical University of Denmark.

 

Adrian Traian Murgan Ph.D, Professor and Head of the Applied Electronics Department, "Politehnica" University of Bucharest. He has published more than 100 scientific papers on information theory, neural networks and fuzzy systems. He was a recipient of the "Alexander von Humboldt" Stiftung grant. He was visiting professor at several German universities. He is a member of IEEE .

1. Introduction

As powerful nonlinear signal processing tools, neural networks have been extensively applied in pattern recognition, classification, filtering and estimation, etc. Attempts to optimize the neural network structures have been made, in order to avoid overfitting and improve generalization, to obtain higher convergence speed and less costly implementations.

There are several techniques for optimizing the neural architecture [1, 2]. Basically, they are: empirical methods, methods based on statistical criteria [1, 2, and others], growing (constructive) [7- 13, and others], decreasing (destructive, pruning) [14- 20, and others] and hybrid methods [21- 24]. Reviews of the growing [3- 5,] and decreasing [6] techniques are also available.

Largely discussed during the last few years [6, 14- 20, and others], mainly with applications in time series prediction, pruning methods lead to compact networks, which show good performance as compared to the starting architecture or to other structures of greater size. Though the resulting configuration is sparsely connected, usually this is not symmetrical. In hardware implementations, or even in software simulations, the designer is sometimes deeply frustrated, due to the lack of symmetry: the weight storage, the programs written for digital signal processors, etc., would benefit from the symmetry of the weight matrices. Many times in circuit theory, symmetry and reciprocity conditions are imposed on circuits. It is clear that a symmetrical neural network structure would be useful. This is the reason why, in the following, we shall make a quantitative comparison of pruning algorithms with symmetry constraints, for feedforward neural networks.

2. Theoretical Approach

A survey in pruning algorithms [6] quotes two broad classes of pruning methods: those which estimate the sensitivity of the error function to the removal of an element and secondly, methods which add terms to the objective function that rewards the network for choosing efficient solutions. There is some overlap of these two groups, since the objective functions could include sensitivity terms. The destructive algorithms evaluated in the following, belong to the former group, including also symmetry constraints [25- 27].

2.1 The Optimal Brain Damage Algorithm

In the Optimal Brain Damage Algorithm (OBD) method, the feedforward neural network (FANN) is trained and the weight saliencies are calculated. The weights with the lowest saliencies are eliminated and finally, the network is retrained.

The weight's saliency is defined as the change in the training error when the weight is eliminated and the remaining weights are retrained to the new minimum [17, 18]. Basically, the saliency is approximated by the second derivative of the cost function w.r.t. the weight.

The OBD technique was applied to pruning FANNs in contiguity problems, time series prediction, etc. [17, 18].

2.2 The Modified Optimal Brain Damage Algorithm

[25, 26] proposed, instead of a simple elimination as in standard OBD, a symmetric pruning of the weights. In the Modified Optimal Brain Damage Algorithm (MOBD), at each step, the weight having the lowest saliency is eliminated, together with its symmetric (as position in the FANN) “pair”.

MOBD led to sparse weight matrices, which contained zero values in symmetrical positions. The expense of getting symmetrical networks was an increase in the test error [25, 26].

2.3 The Tridiagonal Optimal Brain Damage Algorithm

Another solution of symmetric pruning in feedforward neural networks, the Tridiagonal Optimal Brain Damage Algorithm (TOBD), was suggested in [26, 27]. Without loss of generality, one assumes that the weight matrix has the weight vectors as columns. After calculating the saliencies, the column containing the minimum saliency values is determined and a Householder (reflection) transform is performed on the weight matrix. The weights are normalised and a retraining step follows. After several steps, symmetrical weight values resulted in a tridiagonal matrix. Details of the algorithm and experimental results have been presented in [26, 27].

3. Experimental Results

3.1 The Neural Network Architecture and Learning Algorithm

Let the problem be nonlinear regression and let the FANN be a M- H- N multilayer perceptron (M input, H hidden and N output nodes), trained with the backpropagation algorithm with momentum. We have chosen M = 7, H = 3, N = 1.

The training data consisted of 2880 samples of the “chirp” signal (Figure 1), fed into the network through a 7- sample window, sliding one step to the right. The desired output value was the next value following the window (i.e. the eighth). The training set was learned in 2000 epochs, with a learning rate of 0.01 and a momentum equal to 0.0001. The mean square learning error after 2000 epochs, normalised to the number of patterns was 0.004456.

The test data (Figure 1) consisted of 2880 patterns, different from the training data.

Figure 1. Training and Test “Chirp” Signals

3.2 Evaluation Criteria

The evaluation criteria taken into account are: mean square error scaled to the number of patterns (SMSE) in both training and test phases, peak signal- to- noise ratio (PSNR), total number of parameters (weight, biases) in the final neural structure, weight values, the number of “multiply + add” operations in the test phase and the required test time.

The SMSE cost function is given by (3), where yj(x ) and dj(x ) are the actual, and respectively, the desired output values, with 1 £ j £ N, for each input pattern 1 £ x £ P. One denotes by ej(x ) the output error (i.e. the difference dj(x ) - yj(x )) in node j for the input pattern x and by w, the FANN parameters.

(3)

The PSNR in decibels is given by (4), where dmax and dmin are the maximum and respectively, the minimum value of the desired signal.

3.3 Results of the Quantitative Evaluation

3.3.1 Learning and Test Errors

The MOBD, TOBD and OBD learning errors, test errors and PSNR after the pruning steps are

(4)

shown in Table 2. The comparison was made after 500, 1,000, 2,000, 3,000 and 4,000 retraining epochs, corresponding to the pruning steps shown in Table 1.

Table 1. Number of Retraining Epochs (NRE) and the Corresponding MOBD, TOBD and OBD Pruning Step
 
NRE
MOBD
TOBD
OBD
 
500
1
1
2
 
1000
2
2
4
Step 
2000
4
3
6
 
3000
6
4
9
 
4000
7
5
10
 

Both MOBD and TOBD have higher test errors as compared to OBD. However, the difference between MOBD and OBD is not significant, while for TOBD further retraining is necessary in order to increase the PSNR.

Table 2. Learning Errors, Test Errors and PSNR for MOBD, TOBD and OBD Algorithms
NE
MOBD
TOBD
OBD
 
LE
TE
PSNR
[dB]
LE
TE
PSNR
[dB]
LE
TE
PSNR
[dB]
500
0.00332
1.05915
67.23
0.00437
1.07333
65.85
0.00332
1.05846
67.65
1000
0.00322
1.06813
67.23
0.00370
1.07341
62.40
0.00322
1.06818
64.95
2000
0.00394
1.06648
64.98
0.00307
1.07342
63.42
0.00325
0.82931
71.06
3000
0.00331
1.07159
64.84
0.00290
1.07341
64.63
0.00315
0.78907
71.60
4000
0.00277
1.06524
66.89
0.00308
1.07343
65.79
0.00339
0.98223
68.29
 

NRE = Number of retraining epochs; LE = Learning error; TE = Test error

Learning curves during retraining for MOBD and OBD are shown in Figures 2 and 3, where the peaks point to the weight elimination steps. No significant differences may be noticed for the two algorithms.

The test errors evaluated with Akaike’s Final Prediction Error (FPE) criterion are lower for all the algorithms taken into account than the errors yielded by the AR(7) and ARMAX(7) models (Table 3).

Table 3. Test Errors Estimated With Akaike's FPE, for MOBD, TOBD and OBD Algorithms, As Compared To AR(7) and ARMAX(7) Models
Method / model
Estimated test error, using FPE
MOBD
0.0028
TOBD
0.0030
OBD
0.0034
Fully connected net 
0.0045
ARMAX(7)
0.0150
3.3.2 Peak Signal- to- Noise Ratio

Peak signal- to- noise ratio decreases as the number of the FANN parameters changes (Tables 2, 4). The highest values are given by the FANN in the OBD algorithm, closely followed by MOBD. However, PSNR evaluation after more than 10,000 epochs yields a higher value as compared to the fully connected network output (i.e. better generalization).

Table 4. The PSNR Values, As the Number of the FANN Parameters (NP) Decreases (Dec. = Decreases, Inc. = Increases)
Method
PSNR 
NP decreases:
MOBD
Dec. with 1.079 dB
1.65 times
TOBD
Dec. with 2.200 dB
3.11 times
OBD
Inc. with 0.290 dB
1.27 times
 

Figure 2. Retraining During MOBD. Peaks Show Weight Elimination Steps
 
 
Figure 3. Retraining During OBD. Peaks Show Weight Elimination Steps
 

 

 

Figure 4. Neural Network Structures Resulted From OBD, MOBD and TOBD
Dotted Lines Represent Eliminated Connections / Nodes

The FANN resulted from OBD may be further simplified, using more pruning steps. The MOBD structure is symmetrical, like the TOBD net, but in the latter case, only 7 of the 12 parameters have to be stored, as they have symmetrical values. The second hidden node was eliminated in all of the three methods.

Table 5.The Number of Weights, Biases and Stored Parameters
in TOBD, MOBD and OBD Algorithms
Algorithm
Weights 
Biases
No. of stored parameters
TOBD
12
3
9
MOBD
14
3
17
OBD
18
4
22
Fully connected network
24
4
28
3.3.4 Weight Values

The input - hidden and hidden - output weight values resulted from MOBD and OBD, are shown in Figures 5 and 6. The comparison was performed for the same number of retraining epochs. The hidden - output weights at each step in TOBD and OBD are shown in Figures 7 and 8. One may notice that there are no significant differences as to the range of values. However, in TOBD symmetrical connections have been preserved in the structure.

Figure 5. Comparison Between Input - Hidden Weight Values Resulted From MOBD and OBD

3.3.5 Number of “Multiply + Add” Operations in the Test Phase

The number of “multiply + add” operations in the test phase (Table 6) decreases 4.66 times for TOBD, 2.47 times for MOBD and 1.90 times for OBD structures, as compared to the number of operations in the fully connected network.

3.3.6 Required Test Time

Theoretical evaluations of the required test time [37, 38] and experimental values on an IBM - PC 486 / 66 Mhz computer are included in Table 6. One denotes by t0 the time required by a simple add or multiply operation and by tt the transfer time for 16- bit data.

Figure 6.Comparison Between Hidden - Output Weight Values Resulted From MOBD and OBD

4. Conclusions and Future Work

We have performed a quantitative comparison of the MOBD, TOBD and OBD algorithms. The first and the second led to symmetrical, sparsely connected networks, while the third led to a non - symmetrical sparsely connected network.

The results show that the structure decreases dramatically in all cases. The lower mean square test error was reached in the OBD case, followed by MOBD and TOBD, but it corresponds to the highest number of parameters preserved in the neural structure.

Table 6. The number of “Multiply + Add” Operations in the Test Phase and the Required Test Time for the FANN Structures Resulted From MOBD, TOBD and OBD Algorithms
 
Neural network structure resulted from:
Number of “multiply + add” operations in the test phase
 
Required test time
   
Theoretical
Experim. [sec]
TOBD
14810
5760 (5 tt + 7 t0)
15
MOBD
27975
5760 (6 tt + 7 t0)
22
OBD
36205
5760 (7 tt + 9 t0)
25
Fully connected FANN
69120
5760 (7 tt + 10 t0)
42
 

Although the range of the weight values is basically the same, in the MOBD and TOBD methods there is no need to store all the final parameters, as some of them have symmetrical values.

Figure 7. Hidden - Output Weight Values Resulted From TOBD
 
Figure 8. Hidden - Output Node Weight Values, After OBD Steps With Retraining

The number of “multiply + add” operations is minimum for the FANN given by the TOBD algorithm, followed by MOBD and OBD. The same hierarchy is maintained for the required test time.

The comparison performed in this paper leads to the conclusion that no symmetry may be introduced in the neural model, without paying the price of increasing the error, i.e. decreasing the PSNR. Better performances are likely to obtain if one refines the approximations in the OBD algorithm. Joining this matter of future research, there is another problem we currently focus on, regarding theoretical boundaries in using sparse weight matrices to approximate the FANN input - output relationship.

REFERENCES

  1. Cichocki, A. and Unbehauen, R., Neural Networks for Optimization and Signal Processing, John Wiley & Sons, UK, 1993.
  2. Hertz, J., Krogh, A. and Palmer, R. G., Introduction To the Theory of Neural Computation, Addison - Wesley Publ. Co, USA, 1991.
  3. Kwok, T.-Y. and Yeung, D.-Y., -Constructive Feedforward Neural Networks for Regression Problems: A Survey, Technical Report HKUST-CS95-43, The Hong Kong University of Science and Technology, Department of Computer Science, September 1995.
  4. Hen Hu, Y., Configuration of FeedForward Multilayer Perceptron Neural Networks, Preprint, University of Wisconsin , Madison, Department of Electrical and Computer Engineering, USA, 1994.
  5. Fiesler, E., Comparative Bibliography of Ontogenic Neural Networks, Proc. of ICANN'94: Intl. Conference on Artificial Neural Networks, Sorrento, Italy, 1994, pp. 793-796.
  6. Reed, R., Pruning Algorithms- A Survey, IEEE Transactions on Neural Networks, Vol. 4, No. 5, September 1993, pp. 740-747.
  7. Fahlman, S.E. and Lebiere, C., The Cascade - Correlation Learning Architecture, Technical Report CMU-CS-90-100, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, August 1991.
  8. Hoehfeld, M. and Fahlman, S.E., Learning With Limited Numerical Precision Using the Cascade - Correlation Algorithm, Technical Report CMU-CS-91-130, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, May 1991.
  9. Hornik, K., Stinchcombe, M. and White. H., Multi - Layered FeedForward Neural Networks Are Universal Approximators, Neural Networks, Vol. 2, Elsevier Science Ltd, USA, 1990, pp. 359-366.
  10. Frattale Mascioli, F.M., Martinelli, G. and Lazzaro, G., Comparison of Constructive Algorithms for Neural Networks, Proc. of ICANN'94: Intl. Conference on Artificial Neural Networks, Sorrento, Italy, 1994, pp. 731-734.
  11. Mézard, M. and Nadal, J. P., Learning in Feedforward Layered Networks: The Tiling Algorithm, Journal of Physics, No. 22, USA, 1989, pp. 2191-2203.
  12. Nahban, T.M. and Zomaya, A.Y., Toward Generating Neural Network Structures for Function Approximation, Neural Networks, Vol. 7, No. 1, Elsevier Science Ltd, USA, 1994, pp.89-99.
  13. Refenes, A.N. and Vithlani, S., Constructive Learning by Specialization, Proc. of the Intl. Conference on Neural Networks, Helsinki, Finland, 1991.
  14. Xue, Q., Hen Hu, Y. and Tompkins, W.J., Structural Simplification of A FeedForward, Multi - layer Perceptron Artificial Neural Network, Proc. of ICASSP'91, Toronto, Canada, 1991.
  15. Cottrell, M., Girard, B., Girard, Y., Mangeas, M. and Muller, C., SSM: A Statistical Stepwise Method for Weight Elimination, Proc. of ICANN'94: Intl. Conference on Artificial Neural Networks, Sorrento, Italy, 1994, pp. 601-604.
  16. Cibias, T., Soulié, F.F., Gallinari, P. and Raudys, S., Variable Selection With Optimal Cell Damage, Proc. of ICANN'94: Intl. Conference on Artificial Neural Networks, Sorrento, Italy, 1994, pp. 327-330.
  17. Gorodkin, J., Hansen, L.K., Svarer, C. and Winther, O., A Quantitative Study of Pruning By Optimal Brain Damage, Preprint, Electronics Institute, Technical University of Denmark, Lyngby, Denmark, 1993.
  18. Hassibi, B. and Stork, D.G., Second Order Derivatives for Network Pruning: Optimal Brain Surgeon, in S.J. Hanson et al (Eds.) NIPS 5, San Matteo, CA, USA, Morgan Kaufmann Publ. Co, 1993, p.164.
  19. Kung, S.Y. and Hen Hu, Y., A Frobenius Approximation Reduction Method (FARM) for Determining Optimal Number of Hidden Units, Proc. of the IEEE Intl. Conference on Neural Networks, Seattle, WA, USA, July 1991, pp.163-168.
  20. Thodberg, H.H., Improving Generalization of Neural Networks Through Pruning, Intl. Journal of Neural Systems, USA, Vol. 1, No. 4, 1991, pp. 317 - 326.
  21. Chen, Y. Q., Thomas, D.W. and Nixon, M.S., Generating Shrinking Algorithm for Learning Arbitrary Classification, Neural Networks, Vol. 7, No. 9, Elsevier Science Ltd, USA, 1994, pp.1477-1489.
  22. Dumitras, A., Lazarescu, V. and Murgan, A.T., A Growing - Decreasing Method for Designing Neural Filters, in I. Pitas (Ed.) Proc. of the 1995 IEEE Workshop on Nonlinear Signal and Image Processing, Vol. 2, Neos Marmaras, Greece, 1995, pp. 579-582.
  23. Hirose, Y., Yamashita, K. and Hijiya, S., Back-propagation Algorithm Which Varies the Number of Hidden Units, Neural Networks, No. 4, USA, 1991, pp. 61-66.
  24. Hansen, L.K. and Pedersen, M. W., Controlled Growth of Cascade Correlation Nets, ICANN'94: Intl. Conference on Artificial Neural Networks, Sorrento, Italy, 1994, pp. 797-780.
  25. Dumitras, A. et al, MOBD: A Neural Network Pruning Algorithm with Symmetry Constraints, ICSP'96: Intl. Conference on Signal Processing, Beijing, China, October 1996.
  26. Dumitras, A., Artificial Neural Network Structures for Digital Signal Processing, Ph.D Thesis, Electronics and Telecommunications Department, “Politehnica” University of Bucharest, Romania, November 1996.
  27. Dumitras, A. et al, in preparation.