Retargetable Components and Libraries
Retargetable High-Performance Components and Libraries
For many years, retargeting of applications for new architectures has
been a major headache for high performance computation. As new
architectures have emerged at dizzying speed, we have moved from
uniprocessors, to vector machines, symmetric multiprocessors,
synchronous parallel arrays, distributed-memory parallel computers, and
scalable clusters. Each new architecture, and even each new model of a
given architecture, has required retargeting and retuning every
application, often at the cost of many person-months or years of effort.
Unfortunately, we have not yet been able to harness the power of high-performance computing itself to assist in this effort. We propose to change that by embarking on a project to use advanced compilation strategies along with extensive amounts of computing to accelerate the process of moving an application to a new high-performance architecture.
To address the problem of application retargeting, we must exploit some emerging ideas and develop several new technologies.
Automatically Tuned Library Kernels: First, we will exploit the recent work on automatically tuning computations for new machines of a given class. Examples of effective use of this approach include FFTW, Atlas, and UHFFT. The basic idea is to organize the computation so that it is structured to take advantage of a variety of parameterized degrees of freedom, including degree of parallelism and cache block size. Then, an automatically generated set of experiments picks the best parameters for a given new machine. This approach has been extremely successful in producing new versions of the LAPACK BLAS needed to port that linear algebra package to new systems. We will extend this work to systems that can automatically generate the tuning search space for new libraries using automatic application tuning methodologies described in Section “Application and System Performance”.
Self-Adapting Numerical Software: We will explore new approaches to building adaptive numerical software that overcomes many of the deficiencies of current libraries. An adaptive software architecture has roughly three layers. First, there is a layer of algorithmic decision making; the top level of an adaptive system concerns itself with the user data, and based on inspection of it, picks the most suitable algorithm, or parameterization of such algorithms. The component responsible for this decision process is an “Intelligent Agent” that probes the user data and, based on heuristics, chooses among available algorithms. Second, there is the system layer; software on this level queries the state of the parallel resources and decides on a parallel layout based on the information returned. There can be some amount of dialog between this level and the algorithmic level, since the amount of available parallelism can influence algorithm details. Finally, there is the optimized libraries level; here we have kernels that provide optimal realization of computational and communication operations. Details pertaining to the nature of the user data are unlikely to make it to this level. Implicit in this approach is a distinction among several kinds of adaptivity. First of all, there is static adaptivity, where adaptation happens during a one-time installation phase. Contrasting with this type of adaptivity is dynamic adaptivity, where at run-time the nature of the problem and environment are taken into account. Orthogonal to this dichotomy is the distinction of adapting to the user data or the computational platform (e.g., memory hierarchy, communication latency/bandwidth or failure modes). We stress the obvious point that, in order to adapt to user data, a software system needs software that engages in discovery of properties of the input. Oftentimes, such discovery can only be done approximately and based on heuristics, rather than on an exact determination of numerical properties.
Using the above framework we will investigate the use of Matlab as a front-end for computing on a cluster.
We propose to conduct research on the topics described in this section and to use the results of this effort to construct at least one retargetable application of interest to DOE and the ASC program.
Unfortunately, we have not yet been able to harness the power of high-performance computing itself to assist in this effort. We propose to change that by embarking on a project to use advanced compilation strategies along with extensive amounts of computing to accelerate the process of moving an application to a new high-performance architecture.
To address the problem of application retargeting, we must exploit some emerging ideas and develop several new technologies.
Automatically Tuned Library Kernels: First, we will exploit the recent work on automatically tuning computations for new machines of a given class. Examples of effective use of this approach include FFTW, Atlas, and UHFFT. The basic idea is to organize the computation so that it is structured to take advantage of a variety of parameterized degrees of freedom, including degree of parallelism and cache block size. Then, an automatically generated set of experiments picks the best parameters for a given new machine. This approach has been extremely successful in producing new versions of the LAPACK BLAS needed to port that linear algebra package to new systems. We will extend this work to systems that can automatically generate the tuning search space for new libraries using automatic application tuning methodologies described in Section “Application and System Performance”.
Self-Adapting Numerical Software: We will explore new approaches to building adaptive numerical software that overcomes many of the deficiencies of current libraries. An adaptive software architecture has roughly three layers. First, there is a layer of algorithmic decision making; the top level of an adaptive system concerns itself with the user data, and based on inspection of it, picks the most suitable algorithm, or parameterization of such algorithms. The component responsible for this decision process is an “Intelligent Agent” that probes the user data and, based on heuristics, chooses among available algorithms. Second, there is the system layer; software on this level queries the state of the parallel resources and decides on a parallel layout based on the information returned. There can be some amount of dialog between this level and the algorithmic level, since the amount of available parallelism can influence algorithm details. Finally, there is the optimized libraries level; here we have kernels that provide optimal realization of computational and communication operations. Details pertaining to the nature of the user data are unlikely to make it to this level. Implicit in this approach is a distinction among several kinds of adaptivity. First of all, there is static adaptivity, where adaptation happens during a one-time installation phase. Contrasting with this type of adaptivity is dynamic adaptivity, where at run-time the nature of the problem and environment are taken into account. Orthogonal to this dichotomy is the distinction of adapting to the user data or the computational platform (e.g., memory hierarchy, communication latency/bandwidth or failure modes). We stress the obvious point that, in order to adapt to user data, a software system needs software that engages in discovery of properties of the input. Oftentimes, such discovery can only be done approximately and based on heuristics, rather than on an exact determination of numerical properties.
Using the above framework we will investigate the use of Matlab as a front-end for computing on a cluster.
We propose to conduct research on the topics described in this section and to use the results of this effort to construct at least one retargetable application of interest to DOE and the ASC program.