Rational and Convergent Model-Free Adaptive Learning for Team Markov Games1
نویسندگان
چکیده
In this paper, we address multi-agent decision problems where all agents share a common goal. This class of problems is suitably modeled using finite-state Markov games with identical interests. We tackle the problem of coordination and contribute a new algorithm, coordinated Qlearning (CQL). CQL combines Q-learning with biased adaptive play, a coordination mechanism based on the principle of fictitious-play. We analyze how the two methods can be combined without compromising the convergence of either. We illustrate the performance of CQL in several different environments and discuss several properties of this algorithm. Recent years have witnessed increasing interest in extending reinforcement learning (RL) to multi-agent problems. However, reinforcement learning methods often require the environment to be stationary. If a learning agent interacting with an environment where other agents co-exist can disregard them as part of the environment, there is an implicit violation of the stationarity assumption that can lead to poor convergence of the learning algorithms. Even if convergence is attained, the learned policy can be unsatisfactory. Markov games (also known as stochastic games), understood as extensions of Markov processes to multi-agent scenarios, have thoroughly been used as suitable models to address multiagent reinforcement learning problems, and several researchers adapted classical RL methods to this multi-agent framework. Littman [16] proposed the Minimax-Q algorithm as a possible application of Q-learning to zero-sum Markov games. Hu and Wellman [12] later proposed Nash-Q, an elaboration on Minimax-Q that can be applied to general-sum Markov games. They established convergence of Nash-Q under quite stringent conditions, thus leading to the development of Friend-or-Foe Q-learning (FFQ) [18]. FFQ requires less stringent assumptions than Nash-Q, while retaining its convergence properties in several classes of Markov games. Claus and Boutilier [7] proposed joint-action learners (JAL), combining Q-learning with fictitious play in teamMarkov games. Uther and Veloso [25] combined fictitious play with prioritized sweeping to address planning in adversarial scenarios. Gradient-based learning strategies are analyzed with detail in [4, 22]; Bowling and Veloso [5] propose a policy-based learning method that applies a policy hill-climbing strategy with varying step, using the principle of “win or learn fast” (WoLF-PHC). Many other works on multi-agent learning systems can be found in the literature (see, for example, the surveys [3, 20]). In this paper, we address finite-state Markov games with identical interests (henceforth referred as team Markov games). We cast this class of games as a generalization of Markov decision
منابع مشابه
Mini/Micro-Grid Adaptive Voltage and Frequency Stability Enhancement Using Q-learning Mechanism
This paper develops an adaptive control method for controlling frequency and voltage of an islanded mini/micro grid (M/µG) using reinforcement learning method. Reinforcement learning (RL) is one of the branches of the machine learning, which is the main solution method of Markov decision process (MDPs). Among the several solution methods of RL, the Q-learning method is used for solving RL in th...
متن کاملConvergent Reinforcement Learning for Hierarchical Reactive Plans
Hierarchical reinforcement learning techniques operate on structured plans. Although structured representations add expressive power to Markov Decision Processes (MDPs), current approaches impose constraints that force the associated convergence proofs to depend upon a subroutinestyle execution model that restricts adaptive response. We develop an alternate approach to convergent learning that ...
متن کاملModel-Building Adaptive Critics for Semi-Markov Control
Adaptive (or actor) critics are a class of reinforcement learning algorithms. Generally, in adaptive critics, one starts with randomized policies and gradually updates the probability of selecting actions until a deterministic policy is obtained. Classically, these algorithms have been studied for Markov decision processes under model-free updates. Algorithms that build the model are often more...
متن کاملAn Adaptive Approach to Increase Accuracy of Forward Algorithm for Solving Evaluation Problems on Unstable Statistical Data Set
Nowadays, Hidden Markov models are extensively utilized for modeling stochastic processes. These models help researchers establish and implement the desired theoretical foundations using Markov algorithms such as Forward one. however, Using Stability hypothesis and the mean statistic for determining the values of Markov functions on unstable statistical data set has led to a significant reducti...
متن کاملProperties of Equilibrium Asset Prices Under Alternative Learning Schemes
This paper characterizes equilibrium asset prices under adaptive, rational and Bayesian learning schemes in a model where dividends evolve on a binomial lattice. The properties of equilibrium stock and bond prices under learning are shown to differ significantly. Learning causes the discount factor and risk-neutral probability measure to become path-dependent and introduces serial correlation a...
متن کامل