Checkpointing and Migration of parallel processes based on Message Passing Interface
نویسنده
چکیده
This paper presents a Checkpoint-based Rollback Recovery and Migration System for Message Passing Interface, ChaRM4MPI, for Linux Clusters. Some important fault tolerant mechanisms are designed and implemented in this system, which include coordinated checkpointing protocol, synchronized rollback recovery, process migration, and so on. Owing to ChaRM4MPI, the node transient faults can be recovered automatically, and the permanent fault can also be recovered through checkpoint mirroring and process migration techniques. Moreover, users can migrate MPI processes from one node to another manually for load balance or system maintenance. ChaRM4MPI is a user-transparent implementation and introduces a little running time overhead.
منابع مشابه
Malleable iterative MPI applications
Malleability enables a parallel application’s execution system to split or merge processes modifying granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application’s processes. Malleability empowers process migration by allowing the application’s processes to expand or shrink following the availabi...
متن کاملCoCheck: Checkpointing and Process Migration for MPI
Checkpointing of parallel applications can be used as the core technology to provide process migration. Both, checkpointing and migration, are an important issue for parallel applications on networks of workstations. The CoCheck environment which we present in this paper introduces a new approach to provide checkpointing and migration for parallel applications. In difference to existing systems...
متن کاملAutomatic Parallel Program Checkpointing in Message-Passing Environments
Problem of efficient cluster resources usage is very important, because of high demand for parallel computations. Checkpointing allows to manage cluster computing time more efficiently. In this article parallel programs checkpointing problems are discussed and implementation of automatic parallel checkpointing systems for MPI programs is presented. It is based on simple user-space portable chec...
متن کاملController/Precompiler for Portable Checkpointing
This paper presents CPPC (Controller/Precompiler for Portable Checkpointing), a checkpointing tool designed for heterogeneous clusters and Grid infrastructures through the use of portable protocols, portable checkpoint files and portable code. It works at variable level being user-directed, thus generating small checkpoint files. It allows parallel processes to checkpoint independently, without...
متن کاملSRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems
The ability to produce malleable parallel applications that can be stopped and recon gured during the execution can o er attractive bene ts for both the system and the applications. The recon guration can be in terms of varying the parallelism for the applications, changing the data distributions during the executions or dynamically changing the software components involved in the application e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002