Document Actions

MPI Messaging

by admin — last modified 2004-10-15 12:14

Efficient, Portable, and Scalable Support for MPI Messaging

The goals of this research are to investigate the performance tradeoffs of using TCP over Ethernet in cluster computing and to deploy the results of this work on clusters compatible with those in use at LANL. Specialized networks, such as Quadrics and Myrinet, are typically used in cluster computing because they offer higher bandwidth and lower latency than traditional commodity networks. However, raw Gigabit Ethernet is competitive in terms of bandwidth and latency, and it is especially attractive when cost is considered. The drawbacks of Ethernet typically arise because of the way that it is used both by the operating system and the MPI library. With specialized networks, protocol processing is usually handled directly in the MPI library. By doing so, the transport protocol can be tailored to the cluster computing domain by reducing latency, and copying using such techniques as remote DMA. However, these specialized protocols are difficult to develop and improve, and make it difficult to take advantage of many of the features provided by modern operating systems for networking and event management.

In TCP implementations, protocol processing is handled by the operating system, which can be much more efficient than a user-level library. The techniques used in the network stack are mature and are highly optimized for all networking applications. Being in the operating system, all applications benefit from the performance enhancements. Furthermore, Ethernet is clearly less expensive than specialized networks and TCP provides reliability and easy portability across systems. Network servers have been able to achieve extremely high performance levels with TCP, using scalable event notification systems, such as /dev/epoll in Linux, zero-copy I/O, and asynchronous I/O. We have shown that implementing the LA-MPI library with an event-driven messaging thread, which is a well-known technique in the network server domain, can make TCP over Gigabit Ethernet competitive with Myrinet networks with similar raw bandwidth.

We will build on this work and show that other general optimizations to TCP, including zero-copy I/O and TCP segmentation offload, will further improve the performance of our event-driven OpenMPI (previously LA-MPI) library. Memory management within the operating system’s network stack can also be a significant bottleneck. We intend to study and improve the memory management within the stack to streamline networking performance. These changes are a combination of improvements to the operating system’s network stack and the implementation of the MPI library itself, but are mostly applicable to all network communication, not just MPI messaging, making them valuable beyond the supercomputing domain.

LACSI at Rice University

Sections

Personal tools

Document Actions

MPI Messaging