Opposition-Based Q(λ) with Non-Markovian Update
نویسندگان
چکیده
The OQ(λ) algorithm benefits from an extension of eligibility traces introduced as opposition trace. This new technique is a combination of the idea of opposition and eligibility traces to deal with large state space problems in reinforcement learning applications. In our previous works the comparison of the results of OQ(λ) and conventional Watkins’ Q(λ) reflected a remarkable increase in performance for the OQ(λ) algorithm. However the Markovian update of opposition traces is an issue which is investigated in this paper. It has been assumed that the opposite state can be presented to the agent. This may limit the usability of the technique to deterministic environments. In order to relax this assumption the NonMarkovian Opposition-Based Q(λ) (NOQ(λ)) is introduced in this work. The new method is a hybrid of Markovian update for eligibility traces and non-Markovian-based update for opposition traces. The experimental results show improvements of learning speed for the proposed technique compared to Q(λ) and OQ(λ). The new technique performs faster than OQ(λ) algorithm with the same success rate and can be employed for broader range of applications since it does not require determining state transition.
منابع مشابه
Non-Markovian Agent Evolution with EVOLP
Logic Programming Update Languages were proposed as an extension of logic programming, which allow for modelling the dynamics of knowledge bases where both extensional knowledge (facts) as well as intentional knowledge (rules) may change over time due to updates, with important application Multi-Agent Systems (MAS). Despite their generality, these languages do not provide means to directly acce...
متن کاملHq-learning: Discovering Markovian Subgoals for Non-markovian Reinforcement Learning
To solve partially observable Markov decision problems, we introduce HQ-learning, a hierarchical extension of Q-learning. HQ-learning is based on an ordered sequence of subagents, each learning to identify and solve a Markovian subtask of the total task. Each agent learns (1) an appropriate subgoal (though there is no intermediate, external reinforcement for \good" subgoals), and (2) a Markovia...
متن کاملTwo Novel On-policy Reinforcement Learning Algorithms based on TD(lambda)-methods
This paper describes two novel on-policy reinforcement learning algorithms, named QV(λ)-learning and the actor critic learning automaton (ACLA). Both algorithms learn a state value-function using TD(λ)-methods. The difference between the algorithms is that QV-learning uses the learned value function and a form of Q-learning to learn Q-values, whereas ACLA uses the value function and a learning ...
متن کاملMaximum Throughput
Appendix This appendix contains some hints on the calculation of the performance values mentioned in section 3.2. The probability that submitted queries (updates) can be executed at the local database is denoted as l q (l u respectively) and results in l q loc loc r qr = + − ⋅ ⋅ () 1 and l u loc = because loc expresses the preference of accessing original local data and the second term reflects...
متن کاملTrajectory-Based Modified Policy Iteration
This paper presents a new problem solving approach that is able to generate optimal policy solution for finite-state stochastic sequential decision-making problems with high data efficiency. The proposed algorithm iteratively builds and improves an approximate Markov Decision Process (MDP) model along with cost-to-go value approximates by generating finite length trajectories through the state-...
متن کامل