Learning Adversarial Markov Decision Processes with Delayed Feedback
نویسندگان
چکیده
Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) is observed delay. This paper studies online episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the trajectory of episode k are revealed to learner only end k+d?, where delays d? neither identical nor bounded, chosen by an oblivious adversary. We present novel algorithms based on policy optimization achieve near-optimal high-probability regret (K+D)¹?² under full-information feedback, K number episodes D=?? total Under bandit we prove similar assuming stochastic, (K+D)²?³ general case. first consider minimization important setting MDPs
منابع مشابه
Online model learning in adversarial Markov decision processes
Consider, for example, the well-known game of Roshambo (Figure 1), or rock-paper-scissors, in which two players select one of three actions simultaneously. One may know that the adversary will base its next action on some bounded sequence of the past joint actions, but may be unaware of its exact strategy. For example, one may notice that every time it selects P , the adversary selects S in the...
متن کاملLearning Qualitative Markov Decision Processes Learning Qualitative Markov Decision Processes
To navigate in natural environments, a robot must decide the best action to take according to its current situation and goal, a problem that can be represented as a Markov Decision Process (MDP). In general, it is assumed that a reasonable state representation and transition model can be provided by the user to the system. When dealing with complex domains, however, it is not always easy or pos...
متن کاملLearning Markov Decision Processes for Model Checking
Constructing an accurate system model for formal model verification can be both resource demanding and time-consuming. To alleviate this shortcoming, algorithms have been proposed for automatically learning system models based on observed system behaviors. In this paper we extend the algorithm on learning probabilistic automata to reactive systems, where the observed system behavior is in the f...
متن کاملPrincipled Option Learning in Markov Decision Processes
It is well known that options can make planning more efficient, among their many benefits. Thus far, algorithms for autonomously discovering a set of useful options were heuristic. Naturally, a principled way of finding a set of useful options may be more promising and insightful. In this paper we suggest a mathematical characterization of good sets of options using tools from information theor...
متن کاملReinforcement Learning and Markov Decision Processes
Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. Fir...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2022
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v36i7.20690