MPI Messaging
Efficient, Portable, and Scalable Support for MPI Messaging
The goals of this research are to investigate the performance tradeoffs
of using TCP over Ethernet in cluster computing and to deploy the
results of this work on clusters compatible with those in use at LANL.
Specialized networks, such as Quadrics and Myrinet, are typically used
in cluster computing because they offer higher bandwidth and lower
latency than traditional commodity networks. However, raw Gigabit
Ethernet is competitive in terms of bandwidth and latency, and it is
especially attractive when cost is considered. The drawbacks of
Ethernet typically arise because of the way that it is used both by the
operating system and the MPI library. With specialized networks,
protocol processing is usually handled directly in the MPI library. By
doing so, the transport protocol can be tailored to the cluster
computing domain by reducing latency, and copying using such techniques
as remote DMA. However, these specialized protocols are difficult to
develop and improve, and make it difficult to take advantage of many of
the features provided by modern operating systems for networking and
event management.
In TCP implementations, protocol processing is handled by the operating system, which can be much more efficient than a user-level library. The techniques used in the network stack are mature and are highly optimized for all networking applications. Being in the operating system, all applications benefit from the performance enhancements. Furthermore, Ethernet is clearly less expensive than specialized networks and TCP provides reliability and easy portability across systems. Network servers have been able to achieve extremely high performance levels with TCP, using scalable event notification systems, such as /dev/epoll in Linux, zero-copy I/O, and asynchronous I/O. We have shown that implementing the LA-MPI library with an event-driven messaging thread, which is a well-known technique in the network server domain, can make TCP over Gigabit Ethernet competitive with Myrinet networks with similar raw bandwidth.
We will build on this work and show that other general optimizations to TCP, including zero-copy I/O and TCP segmentation offload, will further improve the performance of our event-driven OpenMPI (previously LA-MPI) library. Memory management within the operating system’s network stack can also be a significant bottleneck. We intend to study and improve the memory management within the stack to streamline networking performance. These changes are a combination of improvements to the operating system’s network stack and the implementation of the MPI library itself, but are mostly applicable to all network communication, not just MPI messaging, making them valuable beyond the supercomputing domain.
In TCP implementations, protocol processing is handled by the operating system, which can be much more efficient than a user-level library. The techniques used in the network stack are mature and are highly optimized for all networking applications. Being in the operating system, all applications benefit from the performance enhancements. Furthermore, Ethernet is clearly less expensive than specialized networks and TCP provides reliability and easy portability across systems. Network servers have been able to achieve extremely high performance levels with TCP, using scalable event notification systems, such as /dev/epoll in Linux, zero-copy I/O, and asynchronous I/O. We have shown that implementing the LA-MPI library with an event-driven messaging thread, which is a well-known technique in the network server domain, can make TCP over Gigabit Ethernet competitive with Myrinet networks with similar raw bandwidth.
We will build on this work and show that other general optimizations to TCP, including zero-copy I/O and TCP segmentation offload, will further improve the performance of our event-driven OpenMPI (previously LA-MPI) library. Memory management within the operating system’s network stack can also be a significant bottleneck. We intend to study and improve the memory management within the stack to streamline networking performance. These changes are a combination of improvements to the operating system’s network stack and the implementation of the MPI library itself, but are mostly applicable to all network communication, not just MPI messaging, making them valuable beyond the supercomputing domain.