coordinated checkpointing

Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

Journal: :Journal of Systems and Software 2010

Hiroyuki Okamura Tadashi Dohi

This paper examines comprehensive evaluation of aperiodic time-based checkpointing and rejuvenation schemes maximizing the steady-state system availability in an operational software system. We consider two kinds of maintenance policies: checkpointing prior to rejuvenating (CPTR) and rejuvenating prior to checkpointing (RPTC). These schemes are complementary from each other to schedule checkpoi...

متن کامل

Exploring Checkpointing and Closed Nesting in Distributed Transactional Memory

2013

Alexandru Turcu Roberto Palmieri Binoy Ravindran

Checkpointing and closed nesting are mechanisms typically used for implementing partial roll-back in transactional systems. Closed nesting limits the amount of work to redo on an abort by allowing sub-transactions to abort and retry independently from their parents. Checkpointing goes further and allows a transaction to be rolled back to any previous point where a checkpoint was saved. Checkpoi...

متن کامل

Parallel Checkpoint/Recovery on Cluster of IA-64 Computers

2004

Youhui Zhang Dongsheng Wang Weimin Zheng

We design and implement a high availability parallel run-time system---ChaRM64, a Checkpointbased Rollback Recovery and Migration system for parallel running programs on a cluster of IA-64 computers. At first, we discuss our solution of a user-level, single process checkpoint/recovery library running on IA-64 systems. Based on this library, ChaRM64 is realized, which implements a user-transpare...

متن کامل

Be Kind, Rewind: Checkpoint & Restore Capability for Improving Reliability of Large-Scale Semiconductor Design

2014

Igor Ljubuncic Ravi Giri Andrew Goldis Avikam Rozenfeld

Intel’s chip design run in a large-scale globally distributed environment with 600,000 cores. In the current semiconductor market scenario, a combination of factors such as time to market pressure, explosive growth in the mobile market segment and upcoming new markets has led to a significant increase in the demand for and reliability of computing resources. Checkpointing is a capability that c...

متن کامل

Aggregate Memory as an Intermediate Checkpoint Storage Device

2008

Samer Al-Kiswany Matei Ripeanu Sudharshan S. Vazhkudai

Applications that generate bursty I/O load, like checkpointing, require additional support to perform efficiently on next generation petascale supercomputers. Tens of thousands of processors, generating terabytes of snapshot data at once at each timestep, can easily overwhelm a storage system. Further, even at the current peak I/O bandwidth rates, offered by parallel file system deployments at ...

متن کامل

Static Analysis for Checkpoint Size Reduction in Array-Based Programs

2007

Greg Bronevetsky Radu Rugina

As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a variety of automated system-level ...

متن کامل

Exploring reliability of exascale systems through simulations

2013

Dongfang Zhao Da Zhang Ke Wang Ioan Raicu

Exascale computers are predicted to emerge by the end of this decade with millions of nodes and billions of concurrent cores/threads. One of the most critical challenges for exascale computing is how to effectively and efficiently maintain the system reliability. Checkpointing is the state-of-theart technique for high-end computing system reliability that has proved to work well for current pet...

متن کامل

Disaster Survival Guide in Petascale Computing: An Algorithmic Approach

2001

Jack J. Dongarra Zizhong Chen George Bosilca Julien Langou

1 Disaster Survival Guide in Petascale Computing: An Algorithmic Approach 3 Jack J. Dongarra, Zizhong Chen, George Bosilca, and Julien Langou 1.1 FT-MPI: A fault tolerant MPI implementation . . . . . . . . 6 1.1.1 FT-MPI Overview . . . . . . . . . . . . . . . . . . . . 6 1.1.2 FT-MPI: A Fault Tolerant MPI Implementation . . . 6 1.1.3 FT-MPI Usage . . . . . . . . . . . . . . . . . . . . . . 7 1....

متن کامل

In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

2009

Gang Wang Xiaoguang Liu Ang Li Fan Zhang

Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this pape...

متن کامل

Continuous Checkpointing: Joining the Checkpointing with Virtual Memory Paging

Journal: :Softw., Pract. Exper. 1997

Shang-Te Hsu Ruei-Chuan Chang

Checkpointing is a basic mechanism for backward error-recovery in fault-tolerant systems. A checkpointed process stops execution and saves its states to files periodically. To reduce the file sizes, only data modified between two consecutive checkpoint times is saved. However, existing approaches do not consider operating system paging activities; which, if ignored may double the number of disk...

متن کامل