نتایج جستجو برای: q policy

تعداد نتایج: 381585  

2007
András Antos Rémi Munos Csaba Szepesvári

We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorou...

Journal: :Management Science 2003
Vinayak Deshpande Morris A. Cohen Karen Donohue

Motivated by a study of the logistics systems used to manage consumable service parts for the U.S. military, we consider a static threshold-based rationing policy that is useful when pooling inventory across two demand classes characterized by different arrival rates and shortage (stockout and delay) costs. The scheme operates as a (Q, r) policy with the following feature. Demands from both cla...

2000
Michael P. Dooley

Debt management policy for governments of developing countries must balance conflicting objectives. The structure of explicit and implicit government debt influences the amount of lending private creditors are willing to extend, contractual debt service costs, the probability of default and the costs of default. Because default is not relevant for governments of industrial countries, their debt...

2010
Quanxin Zhu Xinsong Yang Chuangxia Huang Nikolaos Papageorgiou

and Applied Analysis 3 ii A is an action space, which is also supposed to be a Polish space, andA x is a Borel set which denotes the set of available actions at state x ∈ S. The set K : { x, a : x ∈ S, a ∈ A x } is assumed to be a Borel subset of S ×A. iii q · | x, a denotes the transition rates, and they are supposed to satisfy the following properties: for each x, a ∈ K and D ∈ B S , Q1 D → q...

2005
Eric Anderson

Many network applications (such as swarming downloads, peer-to-peer video streaming and file sharing) are made possible by using large groups of peers to distribute and process data. Securing data in such a system requires not just data originators, but also those “distributors,” to enforce access control, verify integrity, or make other content-specific security decisions for the replicated or...

2012
Krishnamurthy Dvijotham Emanuel Todorov

We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs (NMDPs), a class of MDPs that generalize Linearly Solvable MDPs (LMDPs). For finite horizon problems, these lead to simple update equations based on multiple rollouts of the system. We show that our policy gradient algorithms are faster than the PI algorithm, a state of the art policy optimization algorith...

2005
Jan Peters Sethu Vijayakumar Stefan Schaal

This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural p...

2006
Richardson Ribeiro Fabrício Enembreck Alessandro L. Koerich

This paper presents a novel hybrid learning method and performance evaluation methodology for adaptive autonomous agents. Measuring the performance of a learning agent is not a trivial task and generally requires long simulations as well as knowledge about the domain. A generic evaluation methodology has been developed to precisely evaluate the performance of policy estimation techniques. This ...

Journal: :J. Intelligent Manufacturing 2012
Yesser Yedes Anis Chelbi Nidhal Rezg

In this paper we deal with the integrated supply chain management problem in the context of a single vendorsingle buyer system for which the production unit is assumed to randomly shift from an in-control to an out-of-control state. At the end of each production cycle, a corrective or preventive maintenance action is performed, depending on the state of the production unit, and a new setup is c...

Journal: :CoRR 2017
Heejin Jeong Daniel D. Lee

While off-policy temporal difference methods have been broadly used in reinforcement learning due to their efficiency and simple implementation, their Bayesian counterparts have been relatively understudied. This is mainly because the max operator in the Bellman optimality equation brings non-linearity and inconsistent distributions over value function. In this paper, we introduce a new Bayesia...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید