Celso L Mendes and Daniel A Reed (2004)
Monitoring Large Systems via Statistical Sampling
The International Journal of High Performance Computing Applications, Volume 18(2):pp.267-277.
As the trend in parallel systems scales toward petaflop performance tapped by advances in circuit density and by an increasingly available computational Grid, the development of efficient mechanisms for monitoring large systems becomes imperative. When computational components are coupled via dynamically shifting connections with various remote resources, the number of potential factors affecting system behavior is enormous. Yet the overhead of monitoring can be prohibitive. In this paper we present a new technique for monitoring large systems based on statistical sampling. Rather than monitoring each component, we select a statistically valid sample and measure the behavior of sample members. We describe the formal requirements of sample selection and verify the feasibility of our approach with experiments on large parallel systems and wide-area networks. Our results show that this technique can be a powerful tool to enable effective monitoring without incurring the large costs typically associated to exhaustive checking.