Jack J Dongarra and Zizhong Chen (2005)
Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources
University of Tennessee Computer Science Department, Knoxville.
As the desire of scientists to perform ever larger computa- tions drives the size of today’s high performance computers from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often intro- duces a considerable overhead, especially when applications modify a large mount of memory between checkpoints. This paper presents an algorithm-based checkpoint-free fault tolerance approach in which, instead of taking check- points periodically, a coded global consistent state of the critical application data is maintained in memory by modi- fying applications to operate on encoded data. Although the applicability of this approach is not so general as the typi- cal checkpoint/rollback-recovery approach, in parallel linear algebra computations where it usually works, because no periodical checkpoint or rollback-recovery is involved in this approach, partial node failures can often be tolerated with a surprisingly low overhead. We show the practicality of this technique by applying it to the ScaLAPACK matrix-matrix multiplication kernel which is one of the most important kernels for ScaLAPACK library to achieve high performance and scalability. We address the practical numerical issue in this technique by proposing a class of numerically good real number erasure codes based on random matrices. Experimental results demon- strate that the proposed checkpoint-free approach is able to survive process failures with a very low performance over- head.