Crash Management for Distributed Parallel Systems
نویسندگان
چکیده
With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organic computing’s self-x features. Self-healing in this context means that computer clusters should detect and handle failures automatically. This paper presents a self-healing mechanism based on checkpointing, so that a cluster remains operative even if some sites or the connections between them fail. The proposed method has been implemented and tested on the Self Distributing Virtual Machine (SDVM).
منابع مشابه
Static Task Allocation in Distributed Systems Using Parallel Genetic Algorithm
Over the past two decades, PC speeds have increased from a few instructions per second to several million instructions per second. The tremendous speed of today's networks as well as the increasing need for high-performance systems has made researchers interested in parallel and distributed computing. The rapid growth of distributed systems has led to a variety of problems. Task allocation is a...
متن کاملMessage and time efficient consensus protocols for synchronous distributed systems
For a synchronous distributed system of n processes with up to t potential and f actual crash failures, where (t < n − 1, f t), the time lower bound for a protocol to achieve consensus is min(t + 1, f + 2) rounds. Currently, most researches in this field focus on the time efficiency of consensus protocols. This paper proposes consensus protocols for synchronous distributed systems that achieve ...
متن کاملNetwork Storage Management in Data Grid Environment
This paper presents the Network Storage Manager (NSM) developed in the Distributed Computing Laboratory at Jackson State University. NSM is designed as a Java-based, high-performance, distributed storage system, which can be utilized in the Grid environment. NSM architecture presents a framework offering parallelism, scalability, crash recovery, and portability for data-intensive distributed ap...
متن کاملAtomic Broadcast in Asynchronous Crash-Recovery Distributed Systems and Its Use in Quorum-Based Replication
Atomic Broadcast is a fundamental problem of distributed systems: It states that messages must be delivered in the same order to their destination processes. This paper describes a solution to this problem in asynchronous distributed systems in which processes can crash and recover. A Consensus-based solution to Atomic Broadcast problem has been designed by Chandra and Toueg for asynchronous di...
متن کاملA Multi Objective Optimization Model for Redundancy Allocation Problems in Series-Parallel Systems with Repairable Components
The main goal in this paper is to propose an optimization model for determining the structure of a series-parallel system. Regarding the previous studies in series-parallel systems, the main contribution of this study is to expand the redundancy allocation parallel to systems that have repairable components. The considered optimization model has two objectives: maximizing the system mean time t...
متن کاملComputing Global Functions in Asynchronous Distributed Systems with Perfect Failure Detectors
ÐA Global Data is a vector with one entry per process. Each entry must be filled with an appropriate value provided by the corresponding process. Several distributed computing problems amount to compute a function on a global data. This paper proposes a protocol to solve such problems in the context of asynchronous distributed systems where processes may fail by crashing. The main problem that ...
متن کامل