q policy

Fitted Q-iteration in continuous action-space MDPs

2007

András Antos Rémi Munos Csaba Szepesvári

We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorou...

متن کامل

A Threshold Inventory Rationing Policy for Service - Differentiated Demand Classes

Journal: :Management Science 2003

Vinayak Deshpande Morris A. Cohen Karen Donohue

Motivated by a study of the logistics systems used to manage consumable service parts for the U.S. military, we consider a static threshold-based rationing policy that is useful when pooling inventory across two demand classes characterized by different arrival rates and shortage (stockout and delay) costs. The scheme operates as a (Q, r) policy with the following feature. Demands from both cla...

متن کامل

Debt management and crisis in developing countries

2000

Michael P. Dooley

Debt management policy for governments of developing countries must balance conflicting objectives. The structure of explicit and implicit government debt influences the amount of lending private creditors are willing to extend, contractual debt service costs, the probability of default and the costs of default. Because default is not relevant for governments of industrial countries, their debt...

متن کامل

Policy Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish Spaces

2010

Quanxin Zhu Xinsong Yang Chuangxia Huang Nikolaos Papageorgiou

and Applied Analysis 3 ii A is an action space, which is also supposed to be a Polish space, andA x is a Borel set which denotes the set of available actions at state x ∈ S. The set K : { x, a : x ∈ S, a ∈ A x } is assumed to be a Borel subset of S ×A. iii q · | x, a denotes the transition rates, and they are supposed to satisfy the following properties: for each x, a ∈ K and D ∈ B S , Q1 D → q...

متن کامل

Cooperative Policy Control for Peer-to-Peer Data Distribution

2005

Eric Anderson

Many network applications (such as swarming downloads, peer-to-peer video streaming and file sharing) are made possible by using large groups of peers to distribute and process data. Securing data in such a system requires not just data originators, but also those “distributors,” to enforce access control, verify integrity, or make other content-specific security decisions for the replicated or...

متن کامل

Nonlinear Policy Gradient Algorithms for Noise-Action MDPs

2012

Krishnamurthy Dvijotham Emanuel Todorov

We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs (NMDPs), a class of MDPs that generalize Linearly Solvable MDPs (LMDPs). For finite horizon problems, these lead to simple update equations based on multiple rollouts of the system. We show that our policy gradient algorithms are faster than the PI algorithm, a state of the art policy optimization algorith...

متن کامل

Natural Actor-Critic

2005

Jan Peters Sethu Vijayakumar Stefan Schaal

This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural p...

متن کامل

A Hybrid Learning Strategy for Discovery of Policies of Action

2006

Richardson Ribeiro Fabrício Enembreck Alessandro L. Koerich

This paper presents a novel hybrid learning method and performance evaluation methodology for adaptive autonomous agents. Measuring the performance of a learning agent is not a trivial task and generally requires long simulations as well as knowledge about the domain. A generic evaluation methodology has been developed to precisely evaluate the performance of policy estimation techniques. This ...

متن کامل

Quasi-optimal integrated production, inventory and maintenance policies for a single-vendor single-buyer system with imperfect production process

Journal: :J. Intelligent Manufacturing 2012

Yesser Yedes Anis Chelbi Nidhal Rezg

In this paper we deal with the integrated supply chain management problem in the context of a single vendorsingle buyer system for which the production unit is assumed to randomly shift from an in-control to an out-of-control state. At the end of each production cycle, a corrective or preventive maintenance action is performed, depending on the state of the production unit, and a new setup is c...

متن کامل

Bayesian Q-learning with Assumed Density Filtering

Journal: :CoRR 2017

Heejin Jeong Daniel D. Lee

While off-policy temporal difference methods have been broadly used in reinforcement learning due to their efficiency and simple implementation, their Bayesian counterparts have been relatively understudied. This is mainly because the max operator in the Bellman optimality equation brings non-linearity and inconsistent distributions over value function. In this paper, we introduce a new Bayesia...

متن کامل