Thara Angskun, Graham E Fagg, George Bosilca, Jelena Pjesivac-Grbovic, and Jack J Dongarra (2006)
Self-Healing Network for Scalable Fault Tolerant Runtime Environments
In: DAPSYS 2006 Conference Proceedings, Innsbruck, Austria, DAPSYS 6th Austrian-Hungarian Workshop on Distributed and Parallel SystemsS 2006 Conference, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems.
Scalable and fault tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster than the original SFTP routing algorithms.