Overview
The motivation for HPCToolkit and a description of our approach are discussed in depth in the papers. Here we outline some of the key motivation for our work and briefly describe our approach.
Motivation
Our group's primary research interest is in the optimization of programs (both sequential and parallel) for modern high-performance systems. While the goal of the research is to perform these optimizations in a compiler, we also do a lot of optimization and performance analysis by hand. In particular, we found that we were spending too much time and effort in the 'modify, run, analyze' cycle and that existing performance tools were not helping.
The most important problems we had with existing tools were:
- They were not adequately helping us go back and forth between performance data and source code.
- Data was not aggregated at the appropriate level. With optimizing compilers and highly-pipelined, out-of-order processors, line-level (or finer) performance numbers are becoming less and less meaningful. Procedure-level numbers are either too coarse, or they are not meaningful for small procedures unless you turn off aggressive compiler optimizations.
- The tools do not help to quickly identify problems through top-down analysis.
- We work in a heterogeneous environment with multiple architectures, compilers, and languages. Existing tools weren't flexible enough.
- Existing tools didn't allow us to compute new performance metrics from existing ones. Metrics such as cache miss rates, FLOPS/cycle, and cycles per instruction are very informative, especially when computed at many levels of the program.
- Existing tools did not sufficiently support the tuning cycle. GUI interfaces may be easy the first time you use them, but not if you have to manually redo everything each time around the cycle.
We deveoped HPCToolkit to avoid these limitations and help us more easily gain insight into program performance. The result of our work was a set of useful tools that work on key platforms for high performance computing.
Approach
A program called hpcview is at the toolkit's center. It takes all of the processed profiles, and, under the direction of a configuration file, produces the browsable database. The hpcview user interface is a browser that allows one to easily go back and forth between data and source code. The `classic' interface is implemented in HTML, so it can be accessed both remotely and over the net using commodity browsers. (Open this screenshot in a new window.) There are panes that list the program's files, display source code, and present tables of performance data that can be sorted by each of the columns. We are in the prototype stages of developing another but similar interface. The Java-based hpcviewer will be able to to take advantage of databases that are much more compact and transferable than the current HTML ones.
The user interface presents performance data in a hierarchical display. At any time, you are looking at some program context (program, file, procedure, loop, or line). Also displayed is the data for both the parent and the children of the current context. Up and down arrows on the lines of the display are used to walk the hierarchy. In order to speed up top-down analysis, the interface also provides `flatten' and `un-flatten' buttons. Their icons hint at their function. `Flatten' modifies the hierarchy by removing non-leaf children of the current node and replacing them with the grandchildren. Unflatten reverses this. Since the tables are sorted, the flatten operation makes short work of diving into the program from the top to identify the most important files, procedures, loops and statements.
Performance data manipulated by hpcview can come from any source, as long as the profile data can be translated or saved directly to a standard, profile-like input format. To date, the principal sources of input data for hpcview have been hardware performance counter profiles. Such profiles are generated by setting up a hardware counter to monitor events of interest (e.g., primary cache misses), to generate a trap when the counter overflows, and then to histogram the program counter values at which these traps occur. SGI's ssrun along with Compaq's uprofile and dcpi utilities collect profiles this way on MIPS and Alpha platforms, respectively. papirun is a tool that can be used on Linux to collect profiles by sampling hardware performance counters. This tool uses UTK's PAPI library for access to hardware performance counters. papiprof is used to map profiles collected using papirun back to program source lines. papiprof is based on code from Curt Janssen's cprof/vprof profiler. hpcview and hpcviewer can be used to view profile-like data of any type, not just data sampled from hardware performance counters. For instance, we built a perl script that would examine assembly code generated by the SGI compilers for MIPS+Irix and create profiles that map register spills back to source code lines.
In addition to measured performance metrics, hpcview allows the user to define expressions to compute derived metrics as functions of the measured data and of previously-computed derived metrics.
To facilitate automation, the programs in the HPCToolkit are intended to be run using scripts and configuration files. Once these are set up, rerunning the program to collect new data, and all of the steps that go into generating a browsable dataset can be completely automated. The scripts automate the collection of data and conversion of profile data into a common, XML-based format.
Profilers, e.g., prof -lines, will report results at the line, procedure, and program level, none of which are very good for performance tuning. To enable the aggregation at other levels, a program called bloop extracts the hierarchical structure of the program and of libraries from (unstripped, compiled -g/g3) binaries. Because it works on binaries, this process is independent of the language used (though in practice it can be somewhat compiler dependent).