Personal tools
You are here: Home Research Systems
Document Actions


by admin last modified 2007-12-10 21:58

The systems activity focuses on research and advanced development of computer subsystems, both hardware and software, of strategic interest to present and future ASC architectures.

The Systems sub-area encompasses research in operating systems and closely allied areas as applied to high performance computing at LANL, specifically within the ASC program.  We focus on research problems that will be critical to the program in a multi-year window beginning in FY05.  In addition to the needs of ASC, the scope of this discussion is further constrained by the interests and abilities of the researchers, research and development programs funded by other sources at the participating institutions, and the LACSI funding level for the work.
Research issues in Systems are organized into two main areas. First, “networking/messaging” refers to problems specifically related to communication research, spanning low-level network architecture to high-level messaging and parallel I/O.  Second, “clustering” encompasses research in software for effective integration of nodes, communication, storage and tools into scalable, high-performance systems.

By the end of FY05, we expect to see dramatic improvements in the raw capabilities of networking hardware and these improvements will become available in commodity products in the succeeding years.  Initially, the commercial and industrial emphasis will be on the use of this hardware in network infrastructures (backbones) and in commercial servers.  Our challenge is to integrate these technologies into system area networks in new generations of clusters for scientific computing.  Software layers must evolve to leverage new hardware to realize better network performance, with lower system overheads, to maintain and enhance the reliability of message passing, and to implement new standards in communication to make systems more useful.

Cluster technology, whether vendor-integrated, user-built Beowulf’s, or ad hoc aggregations of workstations, have had a huge impact on parallel computing.  Because they are effective on many (not all) high-end applications, they have become the backbone that provides capacity computing to LANL, DOE, and the nation. 
In recent years, however, it has become apparent that we need a new generation of clusters to improve productivity.  Conventional clusters are labor intensive to set up, administer, maintain, and upgrade; in many organizations much of the expense of these activities is invisible because they are spread across staff other than designated system administrators.  Better approaches to system integration and system software are needed.  Efficiency and manageability will improve the economics of small to moderate scale systems for capacity computing, but they are absolutely necessary in order to build and run scalable capability systems.

A promising approach to dealing with this issue is the single system image (SSI) model of clusters.   Initially under LACSI support, later from DOE Office of Science, the Cluster Research Lab in CCS-1 pioneered the Clustermatic SSI software package.  While Clustermatic has evolved enough to be useful in production systems, there is still a considerable amount of work to do.  This work on the next generation of Clustermatic spans a spectrum from speculative research to “nuts-and-bolts” development work.

Because of the breadth and scale of “next generation Clustermatic”, the academic partners are committed to being engaged in this effort.  It is therefore important that Clustermatic testbed systems be placed at each of the academic institutions to expose the academic community to the issues (research, development, and operational) of building and using SSI systems.  In addition, placing systems at each of the academic institutions will ensure that software efforts are consistent with mainstream Clustermatic development.  Rice, the University of New Mexico, and the University of North Carolina acquired such testbeds during FY04.

« September 2010 »
Su Mo Tu We Th Fr Sa

Powered by Plone

LACSI Collaborators include: