Document Actions

Fault Tolerance

by admin — last modified 2004-10-15 12:14

Highly Scalable Fault Tolerance

As supercomputers scale to tens of thousands nodes, reliability and availability become increasingly critical. Both experimentation and theory have shown that the large component counts in very large-scale systems mean hardware faults are more likely to occur, especially for long-running jobs. The most popular parallel programming paradigm, MPI, has little support for reliability (i.e., when a node fails, all MPI processes are killed, and the user loses all computation since the last checkpoint). In addition, disk-based checkpointing requires high bandwidth I/O systems to record checkpoints. The collaborative OpenMPI effort promises one possible solution to application-mediated recovery.

To complement this effort, we will build on and expand development of tools for real-time monitoring of system failure indicators (e.g., temperature, soft memory errors and disk retries), tied to Clustermatic and other scalable cluster infrastructures. These tools will include mechanisms to estimate node failure probabilities, as a basis for fault tolerance techniques. We will develop and expand “performability” models that combine both fault-tolerance and performance for systems containing thousands of nodes. These models will include total time to solution as a function of failure modes and probabilities.

The modeling will be complemented by an experimental harness in which developers of scalable fault-tolerant applications will be able to test their codes by selecting special batch queues with controls for likely failure probabilities and modes. The latter will rely on a set of fault injection tools that can assess the susceptibility of large-scale applications to transient memory, network interface card (NIC) or storage errors.
Our goal is to estimate node failure probabilities and introduce enough redundancy to enable recovery. This approach complements disk-based checkpointing schemes to recover from failures between disk checkpoints. We envision it is a low overhead checkpoint alternative that can be performed much more often than disk checkpointing, triggered either periodically or via system measurements. Finally, we expect this approach to include intelligent learning and adaptation. By monitoring and analyzing failure modes, the system can estimate the requirements adaptation to achieve a specified reliability. This will enable smoothly balancing performance and reliability.

In addition to these issues, fault tolerance data collection must be scalable and integrated with low overhead performance measurement systems. We will investigate integrated, sample-based measurement schemes that can collect failure and performance data from systems containing thousands or tens of thousands of nodes.

We will also work with the OpenMPI effort to integrate these predictive capabilities with newly developed MPI fault tolerance mechanisms, with extensions to adaptively choose checkpoint frequencies (either disk or memory) based on predictive failure probabilities. Our goal is an adaptive MPI system that can estimate and configure the degree of redundancy and disk or memory checkpointing needed to ensure reliable computation.

LACSI at Rice University

Sections

Personal tools

Document Actions

Fault Tolerance