System Resilience at Extreme Scale White Paper
نویسنده
چکیده
Professor Ricardo Bianchini, Rutgers University, Piscataway Professor Tarek El-Ghazawi, George Washington University, Washington D.C. Professor Armando Fox, University of California, Berkeley Forest Godfrey, Cray, Minneapolis Dr. Adolfy Hoisie, Los Alamos National Laboratory, Los Alamos Professor Kathryn McKinley, University of Texas, Austin Professor Rami Melhem, University of Pittsburgh, Pittsburgh Professor James Plank, University of Tennessee, Knoxville Dr. Partha Ranganathan, HP Labs, Palto Alto Josh Simons, Sun Microsystems, Cambridge
منابع مشابه
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution sp...
متن کاملInter-Agency Workshop on HPC Resilience at Extreme Scale
The following report summarizes the proceedings of a three-and-a-half day inter-agency workshop focused on the technical challenges of HPC resilience in the 2020 Exascale timeframe. The resilience problem is not specific to any particular program or agency; coordinated resilience solutions will be challenging because of the need for a truly integrated approach. The interagency workshop therefor...
متن کاملResilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest that very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to ca...
متن کاملEnergy profile of rollback-recovery strategies in high performance computing
Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major challenges for machines at that scale. The number of components that will be assembled in the supercomputers plays a fundamental role in these challenges. First, a...
متن کاملOn the Definition of Cyber-Physical Resilience in Power Systems
Modern society relies heavily upon complex and widespread electric grids. In recent years, advanced sensors, intelligent automation, communication networks, and information technologies (IT) have been integrated into the electric grid to enhance its performance and efficiency. Integrating these new technologies has resulted in more interconnections and interdependencies between the physical and...
متن کامل