Batch-Switching Policy Iteration
نویسندگان
چکیده
Policy Iteration (PI) is a widely-used family of algorithms for computing an optimal policy for a given Markov Decision Problem (MDP). Starting with an arbitrary initial policy, PI repeatedly updates to a dominating policy until an optimal policy is found. The update step involves switching the actions corresponding to a set of “improvable” states, which are easily identified. Whereas progress is guaranteed even if just one improvable state is switched at every step, the canonical variant of PI, attributed to Howard [1960], switches every improvable state in order to obtain the next iterate. For MDPs with n states and 2 actions per state, the tightest known bound on the complexity of Howard’s PI is O(2/n) iterations. To date, the tightest bound known across all variants of PI is O(1.7172) expected iterations for a randomised variant introduced by Mansour and Singh [1999]. We introduce Batch-Switching Policy Iteration (BSPI), a family of deterministic PI algorithms that switches states in “batches”, taking the batch size b as a parameter. By varying b, BSPI interpolates between Howard’s PI and another previously-studied variant called Simple PI [Melekopoglou and Condon, 1994]. Our main contribution is a bound of O(1.6479) on the number of iterations taken by an instance of BSPI. We believe this is the tightest bound shown yet for any variant of PI. We also present experimental results that suggest Howard’s PI might itself enjoy an even tighter bound.
منابع مشابه
Batch Policy Iteration Algorithms for Continuous Domains
This paper establishes the link between an adaptation of the policy iteration method for Markov decision processes with continuous state and action spaces and the policy gradient method when the differentiation of the mean value is directly done over the policy without parameterization. This approach allows deriving sound and practical batch Reinforcement Learning algorithms for continuous stat...
متن کاملAnalysis of Manufacturing Systems
1. fntroduction We consider poliing systems where service is given in batches of unlimited size. When the server visits a queue, all customers present are served in a single batch. We call tlis gated batch seruice. The batch service time is independent of the size of the batch. Some examples of such systems are discussed in the literature below. Examples more related to manufacturing are ovens,...
متن کاملOptimal Control for an M/g/1/n + 1 Queue with Two Service Modes
A finite-buffer queueing model is considered with batch Poisson input and controllable service rate. A batch that upon arrival does not fit in the unoccupied places of the buffer is partially rejected. A decision to change the service mode can be made at service completion epochs only and vacation (switch-over) times are involved to prepare for the new mode. During a switch-over time service is...
متن کاملOptimization of a tutoring system from a fixed set of data
In this paper, we present a general method for optimizing a tutoring system with a target application in the domain of second language acquisition. More specifically, the optimisation process aims at learning the best sequencing strategy for switching between teaching and evaluation sessions so as to maximise the increase of knowledge of the learner in an adapted manner. The most important feat...
متن کاملOptimisation d'un tuteur intelligent à partir d'un jeu de données fixé (Optimization of a tutoring system from a fixed set of data) [in French]
Optimization of a tutoring system from a fixed set of data In this paper, we present a general method for optimizing a tutoring system with a target application in the domain of second language acquisition. More specifically, the optimisation process aims at learning the best sequencing strategy for switching between teaching and evaluation sessions so as to maximise the increase of knowledge o...
متن کامل