Improving Existing Fault Recovery Policies

نویسندگان

  • Guy Shani
  • Christopher Meek
چکیده

An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

O N Improving the Fault Detection and Reliability in Fddi Networks

FDDI (Fiber Distributed Data Interface) is a 100 Mbps token ring network with two counter rotating optical rings. Various possible faults (like lost token, link failures etc.) are considered and fault detection and the ring recovery process in case of a failure and the reliability mechanisms provided are studied. We suggest a new method to improve the fault detection and ring recovery process. ...

متن کامل

Performance tuning policies for application level fault tolerance in distributed object systems

In distributed object systems, application level fault tolerance is often attained by appropriate object replication policies. These policies aim at increasing the exhibited service availability by masking potential faults that do not recur after recovery. Existing middleware support infrastructures allow customizing object replication properties. However, since fault tolerance has a significan...

متن کامل

On the Speedup of Recovery in Large - Scale Erasure - Coded Storage Systems ( Supplementary File )

Our work focuses on the recovery solutions for XORbased erasure codes. We point out that regenerating codes [5] have recently been proposed to minimize the recovery bandwidth in distributed storage systems. The idea is that surviving storage nodes compute and transmit linear combinations of their stored data during failure recovery. On the other hand, in XOR-based erasure codes, we do not requi...

متن کامل

Improving Point-Based POMDP Policies at Run-Time

Point-based algorithms have been widely used for computing approximate solutions for POMDPs. While they work well in many cases, they can perform very poorly if the current belief state at run time has not been well sampled. In this paper we proposed several heuristic functions for estimating when offline approximate policies are likely to perform poorly at the current belief point. We show tha...

متن کامل

Design and Implementation of a Fault Tolerant Job Flow Manager Using Job Flow Patterns and Recovery Policies

Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009