Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax
نویسندگان
چکیده
This paper proposes “Value-Difference Based Exploration combined with Softmax action selection” (VDBE-Softmax) as an adaptive exploration/exploitation policy for temporal-difference learning. The advantage of the proposed approach is that exploration actions are only selected in situations when the knowledge about the environment is uncertain, which is indicated by fluctuating values during learning. The method is evaluated in experiments having deterministic rewards and a mixture of both deterministic and stochastic rewards. The results show that a VDBE-Softmax policy can outperform ε-greedy, Softmax and VDBE policies in combination with onand off-policy learning algorithms such as Q-learning and Sarsa. Furthermore, it is also shown that VDBE-Softmax is more reliable in case of value-function oscillations.
منابع مشابه
Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences
This paper presents “Value-Difference Based Exploration” (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the exploration parameter of ε-greedy in dependence of the temporal-difference error observed from value-function backups, which is considered as a measure of the agent’s uncertainty about the environment. VDB...
متن کاملTask Allocation through Vacancy Chains: Action Selection in Multi-Robot Learning
We present an adaptive multi-robot task allocation algorithm based on vacancy chains, a resource distribution process common in animal and human societies. The algorithm uses individual reinforcement learning of task utilities and relies on the specializing abilities of the members of the group to promote dedicated optimal allocation patterns. We demonstrate through experiments in simulation, t...
متن کاملDynamic Locomotion Skills for Obstacle Sequences Using Reinforcement Learning
Most locomotion control strategies are developed for flat terrain. We explore the use of reinforcement learning to develop motor skills for the highly dynamic traversal of terrains having sequences of gaps, walls, and steps. Results are demonstrated using simulations of a 21-link planar dog and a 7-link planar biped. Our approach is characterized by: non-parametric representation of the value f...
متن کاملReinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems
Multi-armed bandit tasks have been extensively used to model the problem of balancing exploitation and exploration. A most challenging variant of the MABP is the non-stationary bandit problem where the agent is faced with the increased complexity of detecting changes in its environment. In this paper we examine a non-stationary, discrete-time, finite horizon bandit problem with a finite number ...
متن کاملTask Allocation through Vacancy Chains: Effects of Action Selection in Multi-Robot Learning
We present an adaptive multi-robot task allocation algorithm based on vacancy chains, a resource distribution process common in animal and human societies. The algorithm uses individual reinforcement learning of task utilities and relies on the specializing abilities of the members of the group to promote dedicated optimal allocation patterns. We demonstrate through experiments in simulation, t...
متن کامل