Edgar Gabriel, Graham E Fagg, Antonin Bukovsky, Thara Angskun, and Jack Dongarra (2003)
A Fault-Tolerant Communication Library for Grid Environments
17th Annual ACM International Conference on Supercomputing (ICS 2003) International Workshop on Grid Computing and e-Science.
With increasing numbers of processors and applications running in virtual Grid environments, application level fault-tolerance is getting more of an important issue. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as some tools supporting end-users during the application development step with FT-MPI are presented. Furthermore, a performance comparison of FT-MPI to the most relevant MPI-libraries for point-to-point benchmarks and the High Performance Linpack Benchmark, is shown.