Preference-based Reinforcement Learning
نویسندگان
چکیده
This paper investigates the problem of policy search based on the only expert’s preferences. Whereas reinforcement learning classically relies on a reward function, or exploits the expert’s demonstrations, preference-based policy learning (PPL) iteratively builds and optimizes a policy return estimate as follows: The learning agent demonstrates a few policies, is informed of the expert’s preferences about the demonstrated policies, constructs a utility function compatible with all expert preferences, uses it in a self-training phase, and demonstrates in the next iteration the policy maximizing the current utility function. PPL actually tackles an expensive optimization problem, as each policy assessment relies on the expert’s feedback. The expert’s preference model is used as a surrogate model of the true objective function. Compared to earlier work devoted to active preference learning (see e.g. Brochu et al., 2008), a main lesson is that the surrogate model must preferably be learned on the state × action space, as opposed to, on the policy parametric space. Empirical evidence from the cancer treatment policy domain is provided to support and discuss this claim.
منابع مشابه
Preference-Based Reinforcement Learning: A preliminary survey
Preference-based reinforcement learning has gained significant popularity over the years, but it is still unclear what exactly preference learning is and how it relates to other reinforcement learning tasks. In this paper, we present a general definition of preferences as well as some insight how these approaches compare to reinforcement learning, inverse reinforcement learning and other relate...
متن کاملPreference-Based Policy Iteration: Leveraging Preference Learning for Reinforcement Learning
This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a “preference-based” approach to reinforcement learning is a possible extension of the type of feedback an agent may learn from. In particular, while conventional RL methods are essentially confined to deal with numeri...
متن کاملTowards Preference-Based Reinforcement Learning
This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a preference-based approach to reinforcement learning is the observation that in many real-world domains, numerical feedback signals are not readily available, or are defined arbitrarily in order to satisfy the needs o...
متن کاملAPRIL: Active Preference Learning-Based Reinforcement Learning
This paper focuses on reinforcement learning (RL) with limited prior knowledge. In the domain of swarm robotics for instance, the expert can hardly design a reward function or demonstrate the target behavior, forbidding the use of both standard RL and inverse reinforcement learning. Although with a limited expertise, the human expert is still often able to emit preferences and rank the agent de...
متن کاملPreference-learning based Inverse Reinforcement Learning for Dialog Control
Dialog systems that realize dialog control with reinforcement learning have recently been proposed. However, reinforcement learning has an open problem that it requires a reward function that is difficult to set appropriately. To set the appropriate reward function automatically, we propose preference-learning based inverse reinforcement learning (PIRL) that estimates a reward function from dia...
متن کامل