Learning the distribution with largest mean: two bandit frameworks

نویسندگان

  • Emilie Kaufmann
  • Aurélien Garivier
چکیده

Over the past few years, the multi-armed bandit model has become increasingly popular in the machine learning community, partly because of applications including online content optimization. This paper reviews two different sequential learning tasks that have been considered in the bandit literature ; they can be formulated as (sequentially) learning which distribution has the highest mean among a set of distributions, with some constraints on the learning process. For both of them (regret minimization and best arm identification) we present recent, asymptotically optimal algorithms. We compare the behaviors of the sampling rule of each algorithm as well as the complexity terms associated to each problem. Résumé. Le modèle stochastique dit de bandit à plusieurs bras soulève ces dernières années un grand intérêt dans la communauté de l’apprentissage automatique, du fait notamment de ses applications à l’optimisation de contenu sur le web. Cet article présente deux problèmes d’apprentissage séquentiel dans le cadre d’un modèle de bandit qui peuvent être formulés comme la découverte de la distribution ayant la moyenne la plus élevée dans un ensemble de distributions, avec certaines contraintes sur le processus d’apprentissage. Pour ces deux objectifs (minimisation du regret d’une part et identification du meilleur bras d’autre part), nous présentons des algorithmes optimaux, en un sens asymptotique. Nous comparons les stratégies d’échantillonnage employées par ces deux types d’algorithmes ainsi que les quantités caractérisant la complexité de chacun des problèmes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Finding the Largest Mean Among Many

Sampling from distributions to find the one with the largest mean arises in a broad range of applications, and it can be mathematically modeled as a multi-armed bandit problem in which each distribution is associated with an arm. This paper studies the sample complexity of identifying the best arm (largest mean) in a multi-armed bandit problem. Motivated by large-scale applications, we are espe...

متن کامل

On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models

The stochastic multi-armed bandit model is a simple abstraction that has proven useful in many different contexts in statistics and machine learning. Whereas the achievable limit in terms of regret minimization is now well known, our aim is to contribute to a better understanding of the performance in terms of identifying the m best arms. We introduce generic notions of complexity for the two d...

متن کامل

Stochastic Regret Minimization via Thompson Sampling

The Thompson Sampling (TS) policy is a widely implemented algorithm for the stochastic multiarmed bandit (MAB) problem. Given a prior distribution over possible parameter settings of the underlying reward distributions of the arms, at each time instant, the policy plays an arm with probability equal to the probability that this arm has largest mean reward conditioned on the current posterior di...

متن کامل

Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

In search advertising, the search engine needs to select the most profitable advertisements to display, which can be formulated as an instance of online learning with partial feedback, also known as the stochastic multi-armed bandit (MAB) problem. In this paper, we show that the naive application of MAB algorithms to search advertising for advertisement selection will produce sample selection b...

متن کامل

UCB Algorithm for Exponential Distributions

We introduce in this paper a new algorithm for Multi-Armed Bandit (MAB) problems. A machine learning paradigm popular within Cognitive Network related topics (e.g., Spectrum Sensing and Allocation). We focus on the case where the rewards are exponentially distributed, which is common when dealing with Rayleigh fading channels. This strategy, named Multiplicative Upper Confidence Bound (MUCB), a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1702.00001  شماره 

صفحات  -

تاریخ انتشار 2017