Crash Management for Distributed Parallel Systems

نویسندگان

Jan Haase

Frank Eschmann

چکیده

With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organic computing’s self-x features. Self-healing in this context means that computer clusters should detect and handle failures automatically. This paper presents a self-healing mechanism based on checkpointing, so that a cluster remains operative even if some sites or the connections between them fail. The proposed method has been implemented and tested on the Self Distributing Virtual Machine (SDVM).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Static Task Allocation in Distributed Systems Using Parallel Genetic Algorithm

Over the past two decades, PC speeds have increased from a few instructions per second to several million instructions per second. The tremendous speed of today's networks as well as the increasing need for high-performance systems has made researchers interested in parallel and distributed computing. The rapid growth of distributed systems has led to a variety of problems. Task allocation is a...

متن کامل

Message and time efficient consensus protocols for synchronous distributed systems

For a synchronous distributed system of n processes with up to t potential and f actual crash failures, where (t < n − 1, f t), the time lower bound for a protocol to achieve consensus is min(t + 1, f + 2) rounds. Currently, most researches in this field focus on the time efficiency of consensus protocols. This paper proposes consensus protocols for synchronous distributed systems that achieve ...

متن کامل

Network Storage Management in Data Grid Environment

This paper presents the Network Storage Manager (NSM) developed in the Distributed Computing Laboratory at Jackson State University. NSM is designed as a Java-based, high-performance, distributed storage system, which can be utilized in the Grid environment. NSM architecture presents a framework offering parallelism, scalability, crash recovery, and portability for data-intensive distributed ap...

متن کامل

Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems and Its Use in Quorum-Based Replication

Atomic Broadcast is a fundamental problem of distributed systems: It states that messages must be delivered in the same order to their destination processes. This paper describes a solution to this problem in asynchronous distributed systems in which processes can crash and recover. A Consensus-based solution to Atomic Broadcast problem has been designed by Chandra and Toueg for asynchronous di...

متن کامل

A Multi Objective Optimization Model for Redundancy Allocation Problems in Series-Parallel Systems with Repairable Components

The main goal in this paper is to propose an optimization model for determining the structure of a series-parallel system. Regarding the previous studies in series-parallel systems, the main contribution of this study is to expand the redundancy allocation parallel to systems that have repairable components. The considered optimization model has two objectives: maximizing the system mean time t...

متن کامل

Computing Global Functions in Asynchronous Distributed Systems with Perfect Failure Detectors

ÐA Global Data is a vector with one entry per process. Each entry must be filled with an appropriate value provided by the corresponding process. Several distributed computing problems amount to compute a function on a global data. This paper proposes a protocol to solve such problems in the context of asynchronous distributed systems where processes may fail by crashing. The main problem that ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Crash Management for Distributed Parallel Systems

نویسندگان

چکیده

منابع مشابه

Static Task Allocation in Distributed Systems Using Parallel Genetic Algorithm

Message and time efficient consensus protocols for synchronous distributed systems

Network Storage Management in Data Grid Environment

Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems and Its Use in Quorum-Based Replication

A Multi Objective Optimization Model for Redundancy Allocation Problems in Series-Parallel Systems with Repairable Components

Computing Global Functions in Asynchronous Distributed Systems with Perfect Failure Detectors

عنوان ژورنال:

اشتراک گذاری