Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules: A Programmer's Perspective

نویسندگان

  • Sebastian Gerlach
  • Basile Schaeli
  • Roger D. Hersch
چکیده

Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations. The DPS flow graph execution model enables automatic pipelined parallel execution of applications. DPS supports graceful degradation of parallel applications in case of node failures. The fault-tolerance mechanism relies on a set of backup threads stored in the volatile storage of alternate nodes that are kept up to date by both duplicating transmitted data objects and performing periodical checkpointing. The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint. A valid execution order is automatically deduced from the flow graph. The addition of fault-tolerance to a DPS application requires only minor changes to the application’s source code. The present contribution focuses on the development of fault-tolerant parallel applications with DPS from a programmer’s perspective.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Thau Observer for Fault Detection of Micro Parallel Plate Capacitor Subjected to Nonlinear Electrostatic Force

This paper investigates the fault detection of a micro parallel plate capacitor subjected to nonlinear electrostatic force. For this end Thau observer, which has good ability in fault detection of nonlinear system has been presented and governing nonlinear dynamic equation of the capacitor has been presented. Upper and lower threshold for fault detection have been obtained. The robustness of th...

متن کامل

Fault-Tolerant Scheduling of Fine-Grained Tasks in Grid Environments

Divide-and-conquer is a well-suited programming paradigm for parallel Grid applications. Our Satin system efficiently schedules the fine-grained tasks of a divide-andconquer application across multiple clusters in a grid. To accommodate long-running applications, we present a fault-tolerance mechanism for Satin that has negligible overhead during normal execution, while minimizing the amount of...

متن کامل

NFSv4 as the Building Block for Fault Tolerant Applications

We propose extensions to the NFSv4 client architecture that provide recovery services, checkpointing and logging to parallel applications. Fault-tolerance relies on NFSv4 clients forming client clusters that share delegation state and transfer data among their local caches. We contend that parallel extensions to NFSv4 should not be limited to file system technologies, such as parallelizing or v...

متن کامل

Non radial model of dynamic DEA with the parallel network structure

  In this article, Non radial method of dynamic DEA with the parallel network structure is presented and is used for calculation of relative efficiency measures when inputs and outputs do not change equally. In this model, DMU divisions under evaluation have been put together in parallel. But its dynamic structure is assumed in series. Since in real applications there are undesirable inputs an...

متن کامل

Fault Tolerant File Models for MPI-IO Parallel File Systems

Abstract. Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can make the whole system fail. In order to avoid this problem, data must be stored using some kind of redundant technique, so that it can be recovered i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006