Monday , May 29 2023

Fault Tolerance for Conjugate Gradient Solver Based on FT-MPI

Weizhe ZHANG, Hui HE
Harbin Institution of Technology
92 West Dazhi, Harbin, 150001, China,

Abstract: Grid computing is characterized by high speed, large scale, large task quantity, and long cycles. Such characteristics prevent the waste of large amounts of computing power and time that can be attributed to system errors. Moreover, such features provide the fault tolerance of computing resource nodes in the structural system of grid computing, which has become a key issue in the field. This paper describes the current fault-tolerant message passing interface library, designs a grid computing-based task migration and recovery model, and then identifies the functional architecture of each module of the mode. Further analysis and comparison were conducted on the storage mechanism of the fault-tolerant checkpoint of the model as well as its information-encoding algorithm. Finally, the realization of a Checksum algorithm-based fault-tolerant conjugate gradient solver shows the validity of the theory

Keywords: Fault tolerance; CG solver; FT-MPI; computational grid.

>>Full Text
Weizhe ZHANG, Hui HE, Fault Tolerance for Conjugate Gradient Solver Based on FT-MPI, Studies in Informatics and Control, ISSN 1220-1766, vol. 22 (1), pp. 51-60, 2013.


With the rising expansion and continuous improvement in the complexity of the grid-computing scale, the probability of system component failure, including software and hardware, continues to increase. Moreover, the grid-computing application executes a large volume of tasks that have long life cycles. In the case of non-fault-tolerant measures, system errors cause processes with long life cycles to restart, which consequently degrades the calculations made before the error occurred. Additionally, other error-free processes that are related to the erroneous process may have to restart to return the entire system to its original state, thus causing enormous losses. In some cases, a compromise between long periods of calculation and stable system operation remains a challenge.

Thus, studying the fault tolerance of a grid-computing system is of great significance. Designing an effective fault-tolerant mechanism can effectively prevent data loss. Such a mechanism can also prevent process from restarting after an error occurs. Thus, an effective fault-tolerant mechanism will aid in the construction of a reliable, consistent, continuous, low-cost, and high-end grid-computing system.

Linear equations are widely used in scientific and engineering computations. Such equations have long computing cycles and require large amounts of calculations, which can be solved by using a grid-computing platform. The conjugate gradient (CG) method [13, 16-18], also known as the conjugate vector method, is an iterative method for solving linear equations. This method is characterized by the capability to provide an approximate solution for high-order equations. Such an approximate solution meets accuracy requirements only when the number of iterations is considerably smaller than the order. The fault-tolerance of the CG solver in a grid -computing platform is of great practical value.

This paper primarily investigates fault tolerance based on the fault-tolerant message passing interface (FT-MPI) [2, 3] library and then uses checkpoint technology to realize the CG method. When failure occurs in a node of the grid-computing system during calculation, a checkpoint is reasonably set and checkpoint information is effectively encoded for the backup and migration of computing tasks. This process prevents the failure of a single node from causing the entire system to stop operating normally.

The remainder of this paper is organized as follows: Section 2 describes related studies on the existing FT-MPI library. Section 3 provides a task migration and recovery model for a grid-computing environment. Section 4 presents a Checksum checkpoint-based fault-tolerant CG solver, based on the given model and then discusses the related experimental results. Finally, the conclusion and plans for future work are presented.


  1. FOSTER, I., C. KESSELMAN, The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Fransisco, CA. 1999
  2. FAGG, G. E., A. BUKOVSKY, J. J. DONGARRA. Harness and Fault Tolerant MPI. Parallel Computing, 2001.
  3. FAGG, G. E., E. GABRIEL, G. BOSILCA, T. ANGSKUN, Z. CHEN, J. PJESIVAC-GRBOVIC, K. LONDON, J. J. DONGARRA. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems. University of Tennessee, Knoxville, USA, 2004.
  4. STELLNER, G., Cocheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS ’96), Honolulu, Hawaii, 1996.
  5. AGBARIA, A., R. FRIEDMAN, STARFISH: Fault-tolerant Dynamic MPI Programs on Clusters of Workstations. In 8th IEEE International Symposium on High Performance Distributed Computing, 1999.
  6. BOSILCA, G., A. BOUTEILLER, F. CAPPELLO, S. DJILALI, G. F’EDAK, C. GERMAIN, T. H’ERAULT, P. LEMARINIER, O. LODYGENSKY, F. MAGNIETTE, V. N’ERI, A. SELIKHOV. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Super-Computing, Baltimore USA, Novembre 2002.
  7. GRAHAM, R. L., S.-E. CHOI, D. J. DANIEL, N. N. DESAI, R. G. MINNICH, C. E. RASMUSSEN, L. D. RISINGER, M. W. SUKALSKI. A network-failure-tolerant message-passing system for terascale clusters. In ICS. New York, USA, June. 22-26 2002.
  8. BATCHU, R., J. NEELAMEGAM, Z. CUI, M. BEDDHUA, A. SKJELLUM, Y. DANDASS, M. APTE. Mpi/ftTM: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid held in Melbourne, Australia, 2001.
  9. LOUCA, S., N. NEOPHYTOU, A. LACHANAS, P. EVRIPIDOU. Mpi-ft: Portable fault tolerance scheme for mpi. In Parallel Processing Letters, Vol. 10, No. 4, 371-382, World Scientific Publishing Company, 2000.
  10. CHEN, Z., G. E. FAGG. Building Fault Survivable MPI Programs with FTMPI Using Diskless Checkpointing. Computer Science Department, University of Tennessee, USA, 2005.
  11. PLANK, J. S., K. LI, Faster Checkpointing with n+1 Parity. In FTCS, 1994, pp. 288-297.
  12. PLANK, J. S., A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems. Department of Computer Science University of Tennessee, Technical Report CS-96-332.2003.
  13. CHEN, G. L., Parallel Computing – Data Structure and Algorithm, 1999.
  14. DUFF, I., R. GRIMES, J. LEWIS User’s Guide for the Harwell-Boeing Sparse Matrix Collection, October 1992.
  15. DUFF, I., R. GRIMES, J. LEWIS, Sparse Matrix Test Problems. ACM Transactions on Mathematical Software, Volume 15, March 1989, pp. 1-14.
  16. ANDREI, N., Accelerated Conjugate Gradient Algorithm with Modified Secant Condition for Unconstrained Optimization. Studies in Informatics and Control, vol. 18 (3), pp. 211-232, 2009
  17. ANDREI, N., A Hybrid Conjugate Gradient Algorithm for Unconstrained Optimization as a Convex Combination of Hestenes-Stiefel and Dai-Yuan. Studies in Informatics and Control, ISSN 1220-1766, vol. 17 (1) , 2008, pp. 57-70.
  18. ANDREI, N., A Hybrid Conjugate Gradient Algorithm with Modified Secant Condition for Unconstrained Optimization as a Convex Combination of Hestenes-Stiefel and Dai-Yuan Algorithms. Studies in Informatics and Control, ISSN 1220-1766, vol. 17 (4), 2008, pp. 373-392.
  19. JAIRAM, N. K., K. V. KUMAR, N. SATYANARAYANA.Scheduling Tasks on Most Suitable Fault tolerant Resource for Execution in Computational Grid. International Journal of Grid and Distributed Computing, Vol. 5, No. 3, 2012, pp.121-132.
  20. OSMAN, A., A. ANJUM, N. BATOOL, R. MCCLATCHEY. A Fault Tolerant, Dynamic and Low Latency BDII Architecture for Grids. International Journal of Grid and Distributed Computing, Vol. 3, No. 4, pp.1-18, 2010.