Improving the Pareto UCB1 Algorithm on the Multi-Objective Multi-Armed Bandit
نویسندگان
چکیده
In this work, we introduce a straightforward approach for bounding the regret of Multi-Objective Multi-Armed Bandit (MO-MAB) heuristics extended from standard bandit algorithms. The proposed methodology allows us to easily build upon the regret analysis of the heuristics in the standard bandit setting. Using our approach, we improve the Pareto UCB1 algorithm, that is the multi-objective extension of the seminal UCB1, by performing a tighter regret analysis. The resulting Pareto UCB1* also has the advantage of being empirically usable without any approximation. 1 Multi-Objective Multi-Armed Bandit The Multi-Objective Multi-Armed Bandit (MO-MAB) setting [1] is described by a set of arms K associated with a set of random variables vectors {xk,t|t ≥ 1} for each k inK. LetN be the number of objectives. Vector xk,t = [xk,t,1, . . . , xk,t,N ] indicates the random outcome of the k-th arm in its t-th trial, where xk,t,i ∈ R. We consider the stochastic setting where all xk,t associated with k are independent and identically distributed according to some unknown distribution with unknown expectation vector μk = [μk,1, . . . , μk,N ]. Given two arms a and b, a is said to dominate, or Pareto-dominate, b (denoted a b) if μa,i ≥ μb,i for every objective i. The dominance is strict (denoted a b) if μa,i > μb,i for every objective i. Finally, the two arms are incomparable (denoted a ‖ b) if a b and b a. The set of optimal arms contains all the non-dominated arms such thatK∗ = {k ∈ K|@k′ ∈ K,μk′ μk}. The Pareto front P , also referred to as the Pareto-optimal set, contains the expectations of the optimal arms such that P = {μk∗ ∀k∗ ∈ K∗}. In this work, we consider the setting where all optimal arms are considered equivalent, that is we are not biased in playing any of them more than others (in K∗). The problem can be formulated as a game where a player sequentially selects arms inK and observes rewards according to the played arms. Let k(t) denote the arm played at episode t and the reward r(t) = xk(t),t. The goal is to simultaneously maximize the reward over time for all objectives. Therefore, we want to play as much as possible any optimal arm in K∗. Let nk(t) denote the number of times arm k has been played of to time t − 1. The performance is measured with the expected regret metric denoted as E[R(T )] = ∑ k∈K E[nk(T )]∆k, (1) where T is the number of episodes performed up to now and ∆k corresponds to the regret of playing arm k instead of an optimal arm (in K∗). A typical approach for adapting standard bandit heuristics to the MO-MAB setting relies on the concept of Pareto-dominance. Instead of playing the arm that maximizes the expected regret, one ∗{audrey.durand.2, charles.bordet.1}@ulaval.ca †[email protected]
منابع مشابه
Knowledge Gradient for Multi-objective Multi-armed Bandit Algorithms
We extend knowledge gradient (KG) policy for the multi-objective, multi-armed bandits problem to efficiently explore the Pareto optimal arms. We consider two partial order relationships to order the mean vectors, i.e. Pareto and scalarized functions. Pareto KG finds the optimal arms using Pareto search, while the scalarizations-KG transform the multi-objective arms into one-objective arm to fin...
متن کاملMulti-Objective Reinforcement Learning
In multi-objective reinforcement learning (MORL) the agent is provided with multiple feedback signals when performing an action. These signals can be independent, complementary or conflicting. Hence, MORL is the process of learning policies that optimize multiple criteria simultaneously. In this abstract, we briefly describe our extensions to single-objective multi-armed bandits and reinforceme...
متن کاملMulti-armed bandit problem with known trend
We consider a variant of the multi-armed bandit model, which we call multi-armed bandit problem with known trend, where the gambler knows the shape of the reward function of each arm but not its distribution. This new problem is motivated by different on-line problems like active learning, music and interface recommendation applications, where when an arm is sampled by the model the received re...
متن کاملMULTI–ARMED BANDIT FOR PRICING Multi–Armed Bandit for Pricing
This paper is about the study of Multi–Armed Bandit (MAB) approaches for pricing applications, where a seller needs to identify the selling price for a particular kind of item that maximizes her/his profit without knowing the buyer demand. We propose modifications to the popular Upper Confidence Bound (UCB) bandit algorithm exploiting two peculiarities of pricing applications: 1) as the selling...
متن کاملPareto Adaptive Decomposition algorithm
Dealing with multi-objective combinatorial optimization and local search, this article proposes a new multi-objective meta-heuristic named Pareto Adaptive Decomposition algorithm (PAD). Combining ideas from decomposition methods, two phase algorithms and multi-armed bandit, PAD provides a 2-phase modular framework for finding an approximation of the Pareto front. The first phase decomposes the ...
متن کامل