نتایج جستجو برای: q policy

تعداد نتایج: 381585  

2016
Edgard Bonilla Jiaming Zeng Jennie Zheng

We implemented Asynchronous Deep Q-learning to learn the Atari 2600 game Breakout with RAM inputs. We tested the performance of the our agent by varying network structure, training policy, and environment settings. We saw the he most notable improvement through changing the environment settings. Furthermore, we observed interesting training effects when we used a Boltzmann-Q Policy that encoura...

2010
Hado van Hasselt

In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way ...

Journal: :international journal of epidemiology research 0
mehdi ranjbaran epidemiology dept., arak university of medical sciences, arak, i.r. iran mahmoud reza nakhaei nutrition dept., arak university of medical sciences, arak, i.r. iran mina chizary midwifery dept., arak university of medical sciences, arak, i.r. iran mohsen shamsi health education dept., arak university of medical sciences, arak, i.r. iran

background and aims: the world health organization and united nations children’s fund (unicef) recommended executive breastfeeding for 6 months after birth. the purpose of this study was to determine the prevalence of executive breastfeeding in iran by a meta-analysis study to be used by policy-makers in order to health programmer plan in this field. methods: in this meta-analysis study, the da...

2016
Tomáš Petříček Vojtěch Šalanský Karel Zimmermann Tomáš Svoboda

We consider the problem of pan-tilt sensor control for active segmentation of incomplete multi-modal data. Since demanding optimal control does not allow for online replanning, we rather employ the optimal planner offline to provide guiding samples for learning a CNN-based control policy in a guided Q-learning framework. The proposed policy initialization and guided Q-learning avoids poor local...

Journal: :CoRR 2011
Mitchell Keith Bloch

Q-learning is a reliable but inefficient off-policy temporal-difference method, backing up reward only one step at a time. Replacing traces, using a recency heuristic, are more efficient but less reliable. In this work, we introduce model-free, off-policy temporal difference methods that make better use of experience than Watkins’ Q(λ). We introduce both Optimistic Q(λ) and the temporal second ...

2003
Nancy Fulda Dan Ventura

Many reinforcement learning architectures fail to learn optimal group behaviors in the multiagent domain. Although these coordination difficulties are often attributed to the non-Markovian environment created by the gradually-changing policies of concurrently learning agents, a careful analysis of the situation reveals an underlying problem structure which can cause suboptimal group policies ev...

Journal: :Operations Research 2008
Melda Ormeci Matoglu J. G. Dai John H. Vande Vate

When a manufacturer places repeated orders with a supplier to meet changing production requirements, he faces the challenge of finding the right balance between holding costs and the operational costs involved in adjusting the shipment sizes. We consider an inventory whose content fluctuates as a Brownian motion in the absence of control. At any moment, a controller can adjust the inventory lev...

2017

State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment....

2018

State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment....

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید