Clustermatic Performance Instrumentation
Application performance engineering requires a performance
instrumentation and analysis infrastructure that is robust and scalable
while providing the analysis capabilities needed by developers of both
system and application code. There is a demand for
instrumentation of processor performance within an application process
and across all code running on a node. While this may be
sufficient for measuring compute-bound applications, operating system
operations play a vital role in the performance of applications that
communicate with the outside world, including messaging in parallel
applications and any interaction with I/O subsystems. Measuring
these costs will require adding instrumentation to the operating system
kernel and to the specific device drivers that contribute to the costs.
It will be crucial that the performance instrumentation infrastructure be scalable and impose low enough overheads that it can be used to measure production runs on large systems.
To address these issues, UNM is working on novel approaches to measuring operating system and message-passing costs in large-scale systems that are central to ASC's mission. The first part of this research consists of modifying the Linux kernel to monitor and report the operating system costs associated with each network transmission and reception on a per-request basis. By augmenting the Linux kernel with per-request monitoring facilities, we aim to quantify the exact operating system costs that cause operating system interference performance effects similar to those measured at CCS-3. These per-request monitoring facilities then can be used to guide later modifications of Linux to increase message-passing performance, and integrated with more comprehensive system monitoring facilities.
One such comprehensive system monitoring approach is the focus of UNM’s other LACSI research on monitoring. This approach, which we term message-centric monitoring, seeks to extend this approach to the entire system and to measure the complete hardware, operating system, and communication costs associated with ASC message-passing codes. Instead of examining the performance of individual requests on a host-by-host basis, our message-centric profiling approach associates performance data with the data in MPI messages as this data propagates across the system, is received by one host, processed by an application, and sent to another host. The overall goal is to have the performance data associated with a message encompass the entire diameter of the computation required to generate it, with the emphasis on profiling message-passing and operating system costs.
It will be crucial that the performance instrumentation infrastructure be scalable and impose low enough overheads that it can be used to measure production runs on large systems.
Performance Counter Profiling
HPCToolkit from Rice runs on current Clustermatic systems by layering itself on top of PAPI from Tennessee. One problem with this approach is that it allows one to look at internal performance of an application, but it does not provide a system-wide view that captures all phenomena relevant to performance. We will investigate adding such pervasive performance monitoring and analysis infrastructure into Clustermatic systems. The approach taken will be to begin with the oprofil software and extend and modify it to work on Clustermatic systems. This work is coupled to the activities described in Section, “Application and System Performance.”Fine-grained Monitoring of System Software Costs
General operating system costs are becoming increasing impediments to ASC application performance on large-scale machines. Recent studies at CCS-3, for example, have shown that operating system effects in the SAGE ASC code can cause up to a 50% performance penalty on large-scale systems such as ASCI-Q. The current solution to this problem is to dedicate approximately 12.5% of ASCI-Q to operating system services to OS interference issues. While this approach does mitigate the problem, it comes at making a non-trivial portion of the ASCI-Q system unavailable to applications.To address these issues, UNM is working on novel approaches to measuring operating system and message-passing costs in large-scale systems that are central to ASC's mission. The first part of this research consists of modifying the Linux kernel to monitor and report the operating system costs associated with each network transmission and reception on a per-request basis. By augmenting the Linux kernel with per-request monitoring facilities, we aim to quantify the exact operating system costs that cause operating system interference performance effects similar to those measured at CCS-3. These per-request monitoring facilities then can be used to guide later modifications of Linux to increase message-passing performance, and integrated with more comprehensive system monitoring facilities.
One such comprehensive system monitoring approach is the focus of UNM’s other LACSI research on monitoring. This approach, which we term message-centric monitoring, seeks to extend this approach to the entire system and to measure the complete hardware, operating system, and communication costs associated with ASC message-passing codes. Instead of examining the performance of individual requests on a host-by-host basis, our message-centric profiling approach associates performance data with the data in MPI messages as this data propagates across the system, is received by one host, processed by an application, and sent to another host. The overall goal is to have the performance data associated with a message encompass the entire diameter of the computation required to generate it, with the emphasis on profiling message-passing and operating system costs.