Personal tools
You are here: Home Publications Fault Tolerant Communication Library and Applications for High Performance Computing
Document Actions

Graham E Fagg, Edgar Gabriel, Zizhong Chen, Thara Angskun, George Bosilca, Antonin Bukovsky, and Jack Dongarra (ed.) (2003)

Fault Tolerant Communication Library and Applications for High Performance Computing

Proceedings of the Los Alamos Computer Science Institute Symposium, Santa Fe, NM.

With increasing numbers of processors on todays machines, the probability for node or link failures is also increasing. Therefore, application level fault-tolerance is becomin more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications are presented. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.

by admin last modified 2007-12-10 21:05
« August 2010 »
Su Mo Tu We Th Fr Sa
1234567
891011121314
15161718192021
22232425262728
293031
 

Powered by Plone

LACSI Collaborators include:

Rice University LANL UH UNM UIUC UNC UTK