Fitted Q-iteration by Advantage Weighted Regression
نویسندگان
چکیده
Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantageweighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces.
منابع مشابه
An Empirical Comparison of Neural Architectures for Reinforcement Learning in Partially Observable Environments
This paper explores the performance of fitted neural Q iteration for reinforcement learning in several partially observable environments, using three recurrent neural network architectures: Long ShortTerm Memory [7], Gated Recurrent Unit [3] and MUT1, a recurrent neural architecture evolved from a pool of several thousands candidate architectures [8]. A variant of fitted Q iteration, based on A...
متن کاملCFQI: Fitted Q-Iteration with Complex Returns
Fitted Q-Iteration (FQI) is a popular approximate value iteration (AVI) approach that makes effective use of off-policy data. FQI uses a 1-step return value update which does not exploit the sequential nature of trajectory data. Complex returns (weighted averages of the n-step returns) use trajectory data more effectively, but have not been used in an AVI context because of off-policy bias. In ...
متن کاملRegularized Fitted Q-iteration: Application to Planning
We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Q-iteration with penalized (or regularized) least-squares regression as the regression subroutine to address the problem of controlling model-complexity. The algorithm is presented in detail for the case when the functio...
متن کاملBoosted Fitted Q-Iteration
This paper is about the study of B-FQI, an Approximated Value Iteration (AVI) algorithm that exploits a boosting procedure to estimate the action-value function in reinforcement learning problems. B-FQI is an iterative off-line algorithm that, given a dataset of transitions, builds an approximation of the optimal action-value function by summing the approximations of the Bellman residuals acros...
متن کاملDeep Reinforcement Learning with Regularized Convolutional Neural Fitted Q Iteration
We review the deep reinforcement learning setting, in which an agent receiving high-dimensional input from an environment learns a control policy without supervision using multilayer neural networks. We then extend the Neural Fitted Q Iteration value-based reinforcement learning algorithm (Riedmiller et al) by introducing a novel variation which we call Regularized Convolutional Neural Fitted Q...
متن کامل