Measurement and Analysis
Better Tools for Measurement and Analysis of Application Performance
On terascale systems, performance problems are varied and complex.
Hence, a wide range of performance evaluation methods must be
supported. The appropriate data collection strategy depends on the
aspect of program performance under study. Key strategies for gathering
performance data include statistical sampling of program events,
inserting instrumentation into the program via source code
transformations, link time rewriting of object code, binary
modification before or during execution, or program state modification
during execution.
Capturing traces of program events such as message communication helps characterize the temporal dynamics of application performance; however, the scale of these systems implies that a large volume of performance data must be collected and digested. Improved data collection strategies are needed for collecting more useful information and reducing the volume of information that must be collected. Statistical sampling provides a formal basis to achieve desired estimation accuracy under a certain measurement cost. We will investigate the feasibility of using statistical sampling and population dynamics techniques to characterize performance on large systems. This approach will enable tunable control of measurement accuracy and instrumentation overhead. Concurrently, we will explore application of these techniques to the temporal domain, with a goal of bounding temporal performance trajectories.
Research problems to be addressed include determining the appropriate level for implementing different instrumentation and measurement strategies, how to support a modular and extensible framework for performance evaluation, as well as the appropriate compromise between instrumentation cost, the level of detail of measurements, and the volume of data to be gathered.
Current tools for analysis of application performance on extreme-scale systems suffer from numerous shortcomings. Typically, they provide a myopic view of performance emphasizing descriptive rather than prescriptive data (i.e., what happened rather than guides to improvement), and they do not support effective analysis and presentation of data for extreme-scale systems. To help users cope with the overwhelming volume of information about application behavior on extreme-scale systems, more sophisticated analysis strategies are needed for automatically identifying and isolating key phenomena of interest, distilling and presenting application performance data in ways that provide insight into performance bottlenecks, and providing application developers with guidance about where and how their programs can be improved.
Comparing profiles based on different events, computing derived metrics (e.g., event ratios), and correlating profile data with routines, loops and statements in application code can provide application developers with insight into performance problems. However, better statistical techniques are needed for analyzing performance data and for understanding the causes and effects of differences among process performance. Instead of modeling each system component, these techniques select a statistically valid subset of the components, and model the members of that subset in detail. Properties of the subset are used as a basis in estimates for the entire system. Our research in this area, so far, has focused on system availability. We plan to expand that scope and apply these techniques to study application performance. The main goal is to evaluate how well application performance can be characterized and understood, based on a more efficient data collection scheme.
Capturing traces of program events such as message communication helps characterize the temporal dynamics of application performance; however, the scale of these systems implies that a large volume of performance data must be collected and digested. Improved data collection strategies are needed for collecting more useful information and reducing the volume of information that must be collected. Statistical sampling provides a formal basis to achieve desired estimation accuracy under a certain measurement cost. We will investigate the feasibility of using statistical sampling and population dynamics techniques to characterize performance on large systems. This approach will enable tunable control of measurement accuracy and instrumentation overhead. Concurrently, we will explore application of these techniques to the temporal domain, with a goal of bounding temporal performance trajectories.
Research problems to be addressed include determining the appropriate level for implementing different instrumentation and measurement strategies, how to support a modular and extensible framework for performance evaluation, as well as the appropriate compromise between instrumentation cost, the level of detail of measurements, and the volume of data to be gathered.
Current tools for analysis of application performance on extreme-scale systems suffer from numerous shortcomings. Typically, they provide a myopic view of performance emphasizing descriptive rather than prescriptive data (i.e., what happened rather than guides to improvement), and they do not support effective analysis and presentation of data for extreme-scale systems. To help users cope with the overwhelming volume of information about application behavior on extreme-scale systems, more sophisticated analysis strategies are needed for automatically identifying and isolating key phenomena of interest, distilling and presenting application performance data in ways that provide insight into performance bottlenecks, and providing application developers with guidance about where and how their programs can be improved.
Comparing profiles based on different events, computing derived metrics (e.g., event ratios), and correlating profile data with routines, loops and statements in application code can provide application developers with insight into performance problems. However, better statistical techniques are needed for analyzing performance data and for understanding the causes and effects of differences among process performance. Instead of modeling each system component, these techniques select a statistically valid subset of the components, and model the members of that subset in detail. Properties of the subset are used as a basis in estimates for the entire system. Our research in this area, so far, has focused on system availability. We plan to expand that scope and apply these techniques to study application performance. The main goal is to evaluate how well application performance can be characterized and understood, based on a more efficient data collection scheme.