Design and Implementation of a Fault Tolerant Job Flow Manager Using Job Flow Patterns and Recovery Policies
نویسندگان
چکیده
Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Metascheduler, and Montage application as the case study.
منابع مشابه
Design of a Fault-tolerant Job-flow Manager for Grid Environments Using Standard Technologies, Job-flow Patterns, and a Transparent Proxy
The execution of job flow applications is a reality today in academic and industrial domains. Current approaches to execution of job flows often follow proprietary solutions on expressing the job flows and do not leverage recurrent job-flow patterns to address faults in Grid computing environments. In this paper, we provide a design solution to development of job-flow managers that uses standar...
متن کاملBPEL4Job: A Fault-Handling Design for Job Flow Management
Workflow technology is an emerging paradigm for systematic modeling and orchestration of job flow for enterprise and scientific applications. This paper introduces BPEL4Job, a BPEL-based design for fault handling of job flow in a distributed computing environment. The features of the proposed design include: a two-stage approach for job flow modeling that separates base flow structure from faul...
متن کاملStability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملFault Tolerant Parallel Image Generation on a Workstation Network
Image generation for computer movies is a good candidate application for parallelisation. This application was used as a starting point to design a fault tolerant distributed computing environment aimed to run parallel applications. The paper rst describes the context of this work, then it presents the requirements that the environment should meet. The paper then describes the use of the Distri...
متن کاملA New Design of Fault Tolerant Comparator
In this paper we have presented a new design of fault tolerant comparator with a fault free hot spare. The aim of this design is to achieve a low overhead of time and area in fault tolerant comparators. We have used hot standby technique to normal operation of the system without interrupting and dynamic recovery method in fault detection and correction. The circuit is divided to smaller modules...
متن کامل