LACSI at Rice University

Sections

Personal tools

You are here: Home → Publications → Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

Document Actions

Graham E Fagg, Thara Angskun, George Bosilca, Jelena Pjesivac-Grbovic, and Jack J Dongarra (2005)

Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

In: Proceedings of Recent Advances in Parallel Virtual Machine and Messaging Passing Interface Users' Group Meeting Euro PVMMPI 2005, chap. Volume 3666, pp. pp 67-75, Springer Heidelberg, Lecture Notes in Computer Science.

Abstract

Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.

URL http://www.springerlink.com/openurl.asp?genre=article&id=doi:10.1007/11557265\_13

by admin — last modified 2007-12-10 21:05

Copyright © 2003-2010 Los Alamos Computer Science Institute (LACSI),
Rice University, MS 41, 6100 Main Street, Houston, TX 77005.

This material is based on work supported by the Department of Energy under Contract
Nos. 03891-001-99-4G, 74837-001-03 49, 86192-001-04 49, and/or 12783-001-05 49
from the Los Alamos National Laboratory.

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s)
and do not necessarily reflect the views of the Los Alamos National Laboratory or the Department of Energy.

LACSI Collaborators include: