Using replication for resilience on exascale systems
نویسندگان
چکیده
High performance computing applications must be tolerant to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-rollback, by which the application saves its state to secondary storage throughout execution and recover from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should checkpoints be saved. Unfortunately, even using an optimal checkpointing strategy, the frequency of checkpointing must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-rollback. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily mean application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint-rollback at large scale. In this work we investigate two approaches for replication. In the first approach, each process in a single instance of a parallel application is (transparently) replicated. In the second approach, entire application instances are replicated. We provide a theoretical study of these two approaches, comparing them to checkpoint-rollback only, in terms of expected application execution time. Key-words: Fault-tolerance, scheduling, checkpoint, resilience, exascale, replication ∗ LIRMM Montpellier, France † Univ. of Hawai‘i at Manoa, Honolulu, USA ‡ University of Tennessee Knoxville, USA § Ecole Normale Supérieure de Lyon, France ¶ INRIA ha l-0 06 50 32 5, v er si on 2 15 D ec 2 01 1 La réplication pour la résilience des systèmes exascales Résumé : Les applications de calcul haute-performance doivent être tolérantes aux pannes, et ce d’autant plus que les pannes seront fréquentes dans les platesformes post-petascale. La solution traditionnelle de tolérance aux pannes est la sauvegarde de points de reprise (checkpoint) et le retour-arrière (rollback). Dans ce cadre, une application, au cours de son exécution, sauve son état dans un espace de stockage secondaire, état à partir duquel elle redémarrera en cas de panne. Une question souvent étudiée est celle de la politique de sauvegarde optimale: quand les points de reprises doivent-ils être pris ? Malheureusement, même avec une politique de sauvegarde optimale, la fréquence des sauvegardes doit augmenter avec la taille de la plate-forme, induisant une augmentation du surcoût dû au mécanisme de tolérance aux pannes. Ce surcoût interdit d’atteindre une bonne efficacité pour les applications parallèles sur plates-formes de très grande taille. D’autre mécanismes de tolérance aux pannes doivent donc être utilisés. Un de ces mécanismes est la réplication, qui peut être utilisée en association avec un mécanisme de sauvegarde de points de reprise. Avec la réplication, plusieurs processeurs exécutent le même calcul de telle sorte que la panne d’un processeur n’implique pas forcément une interruption de l’exécution de l’application. Bien que la réplication paraisse, à première vue, être un gaspillage de ressources, utiliser conjointement réplication et sauvegarde de points de reprise peut s’avérer significativement plus efficace que la seule utilisation des points de reprise, sur les plates-formes de très grande taille. Dans ce travail nous considérons deux mises en œuvre de la réplication. Dans la première approche, chaque processus d’une unique instance d’une application parallèle est répliqué (de manière transparente). Dans la seconde approche, des instances entières de l’application sont répliquées. Nous menons une étude théorique de ces deux approches, et nous les comparons à la seule sauvegarde des points de reprise, du point de vue de l’espérance du temps de complétion. Mots-clés : Tolérance aux fautes, résilience, checkpoint, ordonnancement, réplication, exascale ha l-0 06 50 32 5, v er si on 2 15 D ec 2 01 1 Using replication for resilience on exascale systems 3
منابع مشابه
Toward Exascale Resilience: 2014 Update
Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process cras...
متن کاملToward Exascale Resilience
Over the past few years resilience has became a major issue for HPC systems, in particular in the perspective of large Petascale systems and future Exascale ones. These systems will typically gather from half a million to several millions of CPU cores running up to a billion of threads. From the current knowledge and observations of existing large systems, it is anticipated that Exascale system...
متن کاملInter-Agency Workshop on HPC Resilience at Extreme Scale
The following report summarizes the proceedings of a three-and-a-half day inter-agency workshop focused on the technical challenges of HPC resilience in the 2020 Exascale timeframe. The resilience problem is not specific to any particular program or agency; coordinated resilience solutions will be challenging because of the need for a truly integrated approach. The interagency workshop therefor...
متن کاملPerformance Impacts with Reliable Parallel File Systems at Exascale Level
The introduction of Exascale storage into production systems will lead to an increase on the number of storage servers needed by parallel file systems. In this scenario, parallel file system designers should move from the current replication configurations to the more space and energy efficient erasure-coded configurations between storage servers. Unfortunately, the current trends on energy eff...
متن کاملUsing group replication for resilience on exascale systems
High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the o...
متن کامل