Off-policy Learning with Linear Action Models: An Efficient ``One-Collection-For-All'' Solution

نویسنده

  • Hengshuai Yao
چکیده

We propose a model-based off-policy learning method that can be used to evaluate any target policy using data collected from arbitrary sources. The key of this method is a set of linear action models (LAM) learned from data. The method is simple to use. First, a target policy tells what actions are taken at some features and LAM project what would happen for the actions. Second, a convergent off-policy learning algorithm such as LSTD and gradient TD algorithms evaluates the projected experience. We focus on two off-policy learning algorithms with LAM, i.e., the stochastic LAM-LSTD and the deterministic LAM-LSTD. Empirical results show that the two LAM-LSTD algorithms give more accurate predictions for various target policies than the on-policy LSTD learning. LAM based off-policy learning algorithms are also exclusively useful in difficult control tasks where one could not collect sufficient “on-policy samples” for on-policy learning. This work leads us to advocate using off-policy learning to evaluate many policies in place of on-policy learning, improving the efficiency of using data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Off-Policy Temporal Difference Learning with Function Approximation

We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goa...

متن کامل

Methods of Optimization in Imprecise Data Envelopment Analysis

  In this paper imprecise target models has been proposed to investigate the relation between imprecise data envelopment analysis (IDEA) and mini-max reference point formulations. Through these models, the decision makers' preferences are involved in interactive trade-off analysis procedures in multiple objective linear programming with imprecise data. In addition, the gradient projection type...

متن کامل

Relation Between Imprecise DESA and MOLP Methods

It is generally accepted that Data Envelopment Analysis (DEA) is a method for indicating efficiency. The DEA method has many applications in the field of calculating the relative efficiency of Decision Making Units (DMU) in explicit input-output environments. Regarding imprecise data, several definitions of efficiency can be found. The aim of our work is showing an equivalence relation between ...

متن کامل

Least-Squares Policy Iteration

We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference a...

متن کامل

Off-Policy Actor-Critic

This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in offpolicy gradient temporal-difference learning....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011