Off-policy Learning with Linear Action Models: An Efficient ``One-Collection-For-All'' Solution
نویسنده
چکیده
We propose a model-based off-policy learning method that can be used to evaluate any target policy using data collected from arbitrary sources. The key of this method is a set of linear action models (LAM) learned from data. The method is simple to use. First, a target policy tells what actions are taken at some features and LAM project what would happen for the actions. Second, a convergent off-policy learning algorithm such as LSTD and gradient TD algorithms evaluates the projected experience. We focus on two off-policy learning algorithms with LAM, i.e., the stochastic LAM-LSTD and the deterministic LAM-LSTD. Empirical results show that the two LAM-LSTD algorithms give more accurate predictions for various target policies than the on-policy LSTD learning. LAM based off-policy learning algorithms are also exclusively useful in difficult control tasks where one could not collect sufficient “on-policy samples” for on-policy learning. This work leads us to advocate using off-policy learning to evaluate many policies in place of on-policy learning, improving the efficiency of using data.
منابع مشابه
Off-Policy Temporal Difference Learning with Function Approximation
We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goa...
متن کاملMethods of Optimization in Imprecise Data Envelopment Analysis
In this paper imprecise target models has been proposed to investigate the relation between imprecise data envelopment analysis (IDEA) and mini-max reference point formulations. Through these models, the decision makers' preferences are involved in interactive trade-off analysis procedures in multiple objective linear programming with imprecise data. In addition, the gradient projection type...
متن کاملRelation Between Imprecise DESA and MOLP Methods
It is generally accepted that Data Envelopment Analysis (DEA) is a method for indicating efficiency. The DEA method has many applications in the field of calculating the relative efficiency of Decision Making Units (DMU) in explicit input-output environments. Regarding imprecise data, several definitions of efficiency can be found. The aim of our work is showing an equivalence relation between ...
متن کاملLeast-Squares Policy Iteration
We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference a...
متن کاملOff-Policy Actor-Critic
This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in offpolicy gradient temporal-difference learning....
متن کامل