Personal tools
You are here: Home Publications Scalable Fault Tolerant Protocol for Parallel Runtime Environments
Document Actions

Thara Angskun, Graham E Fagg, George Bosilca, Jelena Pjesivac-Grbovic, and Jack J Dongarra (2006 submitted)

Scalable Fault Tolerant Protocol for Parallel Runtime Environments

In: 2006 Euro PVM/MPI.

The number of processors embedded on high performance computing platforms is growing daily to satisfy users desire for solving larger and more complex problems. Parallel runtime environments have to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic environments. This paper npresents the design of a scalable and fault tolerant protocol for supporting parallel runtime environment communications. The protocol is designed to support transmission of messages across multiple nodes with in a self-healing topology to protect against recursive node and process failures. A formal protocol verification has validated the protocol for both the normal and failure cases. We have implemented multiple routing algorithms for the protocol and concluded that the variant rulebased routing algorithm yields the best overall results for damaged and incomplete topologies .

by admin last modified 2007-12-10 21:05
« October 2010 »
Su Mo Tu We Th Fr Sa
12
3456789
10111213141516
17181920212223
24252627282930
31
 

Powered by Plone

LACSI Collaborators include:

Rice University LANL UH UNM UIUC UNC UTK