Classes of Multiagent Q-learning Dynamics with epsilon-greedy Exploration
نویسندگان
چکیده
Q-learning in single-agent environments is known to converge in the limit given sufficient exploration. The same algorithm has been applied, with some success, in multiagent environments, where traditional analysis techniques break down. Using established dynamical systems methods, we derive and study an idealization of Q-learning in 2-player 2-action repeated general-sum games. In particular, we address the discontinuous case of -greedy exploration and use it as a proxy for value-based algorithms to highlight a contrast with existing results in policy search. Analogously to previous results for gradient ascent algorithms, we provide a complete catalog of the convergence behavior of the -greedy Q-learning algorithm by introducing new subclasses of these games. We identify two subclasses of Prisoner’s Dilemma-like games where the application of Q-learning with -greedy exploration results in higher-than-Nash average payoffs for some initial conditions.
منابع مشابه
Modelling the dynamics of multiagent Q-learning with ε-greedy exploration
We present a framework to model the dynamics of Multiagent Q-learning with -greedy exploration. The applicability of the framework is tested through experiments in typical games selected from the literature.
متن کاملIncentivizing Exploration In Reinforcement Learning With Deep Predictive Models
Achieving efficient and scalable exploration in complex domains poses a major challenge in reinforcement learning. While Bayesian and PAC-MDP approaches to the exploration problem offer strong formal guarantees, they are often impractical in higher dimensions due to their reliance on enumerating the state-action space. Hence, exploration in complex domains is often performed with simple epsilon...
متن کاملValue-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax
This paper proposes “Value-Difference Based Exploration combined with Softmax action selection” (VDBE-Softmax) as an adaptive exploration/exploitation policy for temporal-difference learning. The advantage of the proposed approach is that exploration actions are only selected in situations when the knowledge about the environment is uncertain, which is indicated by fluctuating values during lea...
متن کاملThe Dynamics of Reinforcement Learning in Cooperative Multiagent Systems
Reinforcement learning can provide a robust and natural means for agents to learn how to coordinate their action choices in multiagent systems. We examine some of the factors that can influence the dynamics of the learning process in such a setting. We first distinguish reinforcement learners that are unaware of (or ignore) the presence of other agents from those that explicitly attempt to lear...
متن کاملIpseity: An Open-Source Platform for Synthesizing and Validating Artificial Cognitive Systems in MAS (Demonstration)
This article presents an overview of Ipseity, an open-source platform developed in C++ with the Qt framework. The current version of the platform includes a set of plugins implementing single-agent and multi-agent environments, hardcoded controllers based on Artificial Intelligence (AI) techniques, classical Reinforcement Learning (RL) techniques like Q-Learning, Sarsa, Epsilon-Greedy combined ...
متن کامل