Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers
نویسندگان
چکیده
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also develop an analytical model of the performance for fault tolerance based on periodic checkpointing and compare this approach to both failure avoidance techniques. We find that this comparison is sensitive to the nature of the stochastic distribution of the time between failures, and that failure avoidance is likely inferior to fault tolerance in the long term. Regardless, our result show that each approach is likely to achieve poor utilization for large-scale platforms (e.g., 2 nodes) unless the mean time between failures is large. We show how bounding parallel job size improves utilization, but conclude that achieving good utilization in future large-scale platforms will require a combination of techniques.
منابع مشابه
Contributing factors to extreme tendencies to internet in students of Shahrekord University of Medical Sciences and providing preventive strategies to deal with it
Background and aims: Internet seems to be increasingly involved a major part of the daily lives of population. In recent years, many reports have confirmed the huge number of internet users worldwide. This article is seeking to explore the factors contributing to the tendency to internet in students of Shahrekord University of Medical Sciences (SKUMS) and aimed to recommend some preventive stra...
متن کاملAnalysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing
In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues tha...
متن کاملComments on ”Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpoint”
In this short note, we provide some comments on the recent paper “Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing” by Bouguerra et al., published in [3]. We start by identifying some errors in their equations. Then we explain that they do not actually use the distribution of lead times, contrary to statements by the authors. Finall...
متن کاملEnergy profile of rollback-recovery strategies in high performance computing
Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major challenges for machines at that scale. The number of components that will be assembled in the supercomputers plays a fundamental role in these challenges. First, a...
متن کاملCheckpointing vs. Migration for Post-Petascale Machines
We craft a few scenarios for the execution of sequential and parallel jobs on future generation machines. Checkpointing or migration, which technique to choose?
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Parallel Processing Letters
دوره 21 شماره
صفحات -
تاریخ انتشار 2011