Pseudometrics for State Aggregation in Average Reward Markov Decision Processes
نویسنده
چکیده
We consider how state similarity in average reward Markov decision processes (MDPs) may be described by pseudometrics. Introducing the notion of adequate pseudometrics which are well adapted to the structure of the MDP, we show how these may be used for state aggregation. Upper bounds on the loss that may be caused by working on the aggregated instead of the original MDP are given and compared to the bounds that have been achieved for discounted reward MDPs.
منابع مشابه
Adaptive aggregation for reinforcement learning in average reward Markov decision processes
We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the orig...
متن کاملA State Aggregation Approach to Singularly Perturbed Markov Reward Processes
In this paper, we propose a single sample path based algorithm with state aggregation to optimize the average rewards of singularly perturbed Markov reward processes (SPMRPs) with a large scale state spaces. It is assumed that such a reward process depend on a set of parameters. Differing from the other kinds of Markov chain, SPMRPs have their own hierarchical structure. Based on this special s...
متن کاملSemi-markov Decision Processes
Considered are infinite horizon semi-Markov decision processes (SMDPs) with finite state and action spaces. Total expected discounted reward and long-run average expected reward optimality criteria are reviewed. Solution methodology for each criterion is given, constraints and variance sensitivity are also discussed.
متن کاملThe BisimDist Library: Efficient Computation of Bisimilarity Distances for Markovian Models
This paper presents a library for exactly computing the bisimilarity Kantorovich-based pseudometrics between Markov chains and between Markov decision processes. These are distances that measure the behavioral discrepancies between non-bisimilar systems. They are computed by using an on-the-fly greedy strategy that prevents the exhaustive state space exploration and does not require a complete ...
متن کاملCombinations and Mixtures of Optimal Policies in Unichain Markov Decision Processes are Optimal
We show that combinations of optimal (stationary) policies in unichain Markov decision processes are optimal. That is, let M be a unichain Markov decision process with state space S, action space A and policies π◦ j : S → A (1 ≤ j ≤ n) with optimal average infinite horizon reward. Then any combination π of these policies, where for each state i ∈ S there is a j such that π(i) = π◦ j (i), is opt...
متن کامل