نتایج جستجو برای: q policy
تعداد نتایج: 381585 فیلتر نتایج به سال:
We implemented Asynchronous Deep Q-learning to learn the Atari 2600 game Breakout with RAM inputs. We tested the performance of the our agent by varying network structure, training policy, and environment settings. We saw the he most notable improvement through changing the environment settings. Furthermore, we observed interesting training effects when we used a Boltzmann-Q Policy that encoura...
In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way ...
background and aims: the world health organization and united nations children’s fund (unicef) recommended executive breastfeeding for 6 months after birth. the purpose of this study was to determine the prevalence of executive breastfeeding in iran by a meta-analysis study to be used by policy-makers in order to health programmer plan in this field. methods: in this meta-analysis study, the da...
We consider the problem of pan-tilt sensor control for active segmentation of incomplete multi-modal data. Since demanding optimal control does not allow for online replanning, we rather employ the optimal planner offline to provide guiding samples for learning a CNN-based control policy in a guided Q-learning framework. The proposed policy initialization and guided Q-learning avoids poor local...
Q-learning is a reliable but inefficient off-policy temporal-difference method, backing up reward only one step at a time. Replacing traces, using a recency heuristic, are more efficient but less reliable. In this work, we introduce model-free, off-policy temporal difference methods that make better use of experience than Watkins’ Q(λ). We introduce both Optimistic Q(λ) and the temporal second ...
Many reinforcement learning architectures fail to learn optimal group behaviors in the multiagent domain. Although these coordination difficulties are often attributed to the non-Markovian environment created by the gradually-changing policies of concurrently learning agents, a careful analysis of the situation reveals an underlying problem structure which can cause suboptimal group policies ev...
When a manufacturer places repeated orders with a supplier to meet changing production requirements, he faces the challenge of finding the right balance between holding costs and the operational costs involved in adjusting the shipment sizes. We consider an inventory whose content fluctuates as a Brownian motion in the absence of control. At any moment, a controller can adjust the inventory lev...
State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment....
State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment....
نمودار تعداد نتایج جستجو در هر سال
با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید