POMDP Learning using Qualitative Belief Spaces

نویسنده

  • Bruce D’Ambrosio
چکیده

We present Κ-abstraction as a method for automatically generating small discrete belief spaces for partially observable Markov decision problems (POMDPs). This permits direct application of existing reinforcement learning methods to POMDPs. We show results from applying these methods to a 256 state POMDP, and discuss the types of problems for which the method is suitable. Topic: Algorithms and Architectures Introduction Many ongoing problems, such as monitoring and repair of on-line systems, are naturally formulated as partially observable Markov decision problems (POMDPs). Informally, a Markov decision problem (MDP) model includes a state space model; an action model; a transition model showing how the state space evolves in response to actions; a reward model which reflects (typically) the cost of actions and the rewards (or costs) associated with various states; and a performance model which describes how agent performance is to be scored. A Partially Observable MDP (POMDP) is one in which we assume the agent does not have direct access to the current state, but has only limited (perhaps noisy) evidence (see [Cassandra et al, 94] for a good overview of the POMDP problem and solution methods). Unfortunately, exact solution remains an elusive goal for all but the most trivial POMDPs [Littman et al, 95], and even approximation methods have achieved only limited success [Parr & Russell, 95]. In general, the solution to a Markov decision problem is a policy, a mapping from each state in the state space to the optimal action given that state. The policy can be represented as a valuation across system states, so often solution methods compute this value function, rather than the policy mapping. For a POMDP, either of these is of limited use, since the agent generally does not precisely know the current state. A standard approach is to transform the problem into that of finding a policy which takes as its argument the belief state of the agent, rather than the actual system state. Under the assumption that the agent uses appropriate Bayesian belief updating procedures, it can be shown that an optimal policy for the (presumably fully observable) belief state is also optimal with respect to the underlying POMDP. Unfortunately, even when the underlying state space is discrete, belief state is continuous. As a result, it would seem that methods developed for solving discrete state MDPs would not apply to POMDPs. It has been shown that the value function for a POMDP must be piecewise linear convex. As a result, many POMDP solution methods, exact and approximate, have focused on building representations of the value function as the max of a set of planes in belief X value space. This representation has the advantage that it can approximate arbitrarily closely the exact solution (and in some cases can exactly represent the optimal value function). However, it suffers from two disadvantages. First, the number of planes needed tends to grow very rapidly, restricting the approach to very small problems. Second, the size of each vector is linear in the size of the state space. This is a severe limitation in many problems. 1 However, see [Singh et al, 94] for an exception. We are investigating an alternate approach, in which we compute a discrete approximation to the belief space and then use standard reinforcement learning methods to compute optimal Q-value with respect to the discretized state space (A Q-Value is a function mapping from belief X action to value). The primitive element of our representation is a Κ-abstracted belief state. While the representation for a plane in PCL-based methods grows linearly with state-space size, Κabstracted belief-state representations grow only with the log of the state space. We have successfully applied this abstraction method to a small test problem (256 states) which is, to our knowledge, the largest POMDP solved to date. In the remainder of this paper we first describe our algorithm, Κ-RL. Next, we present an experimental domain we have been studying, that of on-line maintenance, and the results of some experiments applying Κ-RL in this domain. We close with a discussion of the results and a review of related work.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robot Planning in Partially Observable Continuous Domains

We present a value iteration algorithm for learning to act in Partially Observable Markov Decision Processes (POMDPs) with continuous state spaces. Mainstream POMDP research focuses on the discrete case and this complicates its application to, e.g., robotic problems that are naturally modeled using continuous state spaces. The main difficulty in defining a (belief-based) POMDP in a continuous s...

متن کامل

Dialogue POMDP components (Part II): learning the reward function

The partially observable Markov decision process (POMDP) framework has been applied in dialogue systems as a formal framework to represent uncertainty explicitlywhile being robust to noise. In this context, estimating the dialogue POMDP model components (states, observations, and reward) is a significant challenge as they have a direct impact on the optimized dialogue POMDP policy. Learning sta...

متن کامل

A Possibilistic Model for Qualitative Sequential Decision Problems under Uncertainty in Partially Observable Environments

In this article we propose a qualitative ( ordi­ nal) counterpart for the Partially Observable Markov Decision Processes rnodel (POMDP) in which the uncertainty, as well as the prefer­ ences of the agent, are modeled by possibility distributions. This qualitative counterpart of the POMDP model relies on a possibilistic theory of decision under uncertainty, recently developed. One advantage of s...

متن کامل

Real user evaluation of a POMDP spoken dialogue system using automatic belief compression

This article describes an evaluation of a POMDP-based spoken dialogue system (SDS), using crowdsourced calls with real users. he evaluation compares a “Hidden Information State” POMDP system which uses a hand-crafted compression of the belief space, ith the same system instead using an automatically computed belief space compression. Automatically computed compressions re a way of introducing a...

متن کامل

POMDP Compression and Decomposition via Belief State Analysis

Partially observable Markov decision process (POMDP) is a commonly adopted mathematical framework for solving planning problems in stochastic environments. However, computing the optimal policy of POMDP for large-scale problems is known to be intractable, where the high dimensionality of the underlying belief state space is one of the major causes. Our research focuses on studying two different...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997