Personal tools
You are here: Home Publications Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources
Document Actions

Jack J Dongarra and Zizhong Chen (2005)

Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources

University of Tennessee Computer Science Department, Knoxville.

As the desire of scientists to perform ever larger computa- tions drives the size of today’s high performance computers from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often intro- duces a considerable overhead, especially when applications modify a large mount of memory between checkpoints. This paper presents an algorithm-based checkpoint-free fault tolerance approach in which, instead of taking check- points periodically, a coded global consistent state of the critical application data is maintained in memory by modi- fying applications to operate on encoded data. Although the applicability of this approach is not so general as the typi- cal checkpoint/rollback-recovery approach, in parallel linear algebra computations where it usually works, because no periodical checkpoint or rollback-recovery is involved in this approach, partial node failures can often be tolerated with a surprisingly low overhead. We show the practicality of this technique by applying it to the ScaLAPACK matrix-matrix multiplication kernel which is one of the most important kernels for ScaLAPACK library to achieve high performance and scalability. We address the practical numerical issue in this technique by proposing a class of numerically good real number erasure codes based on random matrices. Experimental results demon- strate that the proposed checkpoint-free approach is able to survive process failures with a very low performance over- head.

∗This research was supported in part by the Los Alamos National Laboratory under Contract No. 03891-001-99 49 and the Applied Mathematical Sciences Research Program of the Office of Mathematical, Information, and Computa- tional Sciences, U.S. Department of Energy under contract DE-AC05-00OR22725 with UT-Battelle, LLC. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
by admin last modified 2007-12-10 21:05
« September 2010 »
Su Mo Tu We Th Fr Sa

Powered by Plone

LACSI Collaborators include: