Personal tools
You are here: Home Publications Assessing Fault Sensitivity in MPI Applications
Document Actions

Charng-da Lu and Daniel A Reed (ed.) (2004)

Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing.

Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to thousands and with proposed petaflop system likely to contain tens of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. Concomitantly, understanding application sensitivity to system failures is critical to establishing confidence in the outputs of large-scale applications. Using software fault injection, we simulated single bit memory errors, register file upsets and MPI message payload corruption and measured the behavioral responses for a suite of MPI applications. These experiments showed that most applications are very sensitive to even single errors. Perhaps most worrisome, the errors were often undetected, yielding erroneous output with no user indicators. Encouragingly, even minimal internal application error checking and program assertions can detect many of the faults we injected.

by admin last modified 2007-12-10 21:05
« September 2010 »
Su Mo Tu We Th Fr Sa
1234
567891011
12131415161718
19202122232425
2627282930
 

Powered by Plone

LACSI Collaborators include:

Rice University LANL UH UNM UIUC UNC UTK