A Lemma on the Multiarmed Bandit Problem
نویسنده
چکیده
We prove a lemma on the optimal value function for the mdtiarmed bandit problem which provides a simple direct proof of optimality of writeoff policies. This, in turn, leads to a new proof of optimality of the index rule.
منابع مشابه
PAC-Bayesian Analysis of Martingales and Multiarmed Bandits
We present two alternative ways to apply PAC-Bayesian analysis to sequences of dependent random variables. The first is based on a new lemma that enables to bound expectations of convex functions of certain dependent random variables by expectations of the same functions of independent Bernoulli random variables. This lemma provides an alternative tool to Hoeffding-Azuma inequality to bound con...
متن کاملOptimal Sequential Exploration: Bandits, Clairvoyants, and Wildcats
This paper was motivated by the problem of developing an optimal strategy for exploring a large oil and gas field in the North Sea. Where should we drill first? Where do we drill next? The problem resembles a classical multiarmed bandit problem, but probabilistic dependence plays a key role: outcomes at drilled sites reveal information about neighboring targets. Good exploration strategies will...
متن کاملOn the Optimal Reward Function of the Continuous Time Multiarmed Bandit Problem
The optimal reward function associated with the so-called "multiarmed bandit problem" for general Markov-Feller processes is considered. It is shown that this optimal reward function has a simple expression (product form) in terms of individual stopping problems, without any smoothness properties of the optimal reward function neither for the global problem nor for the individual stopping probl...
متن کاملAsymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching - Automatic Control, IEEE Transactions on
We consider multiarmed bandit problems with switching cost, define uniformly good allocation rules, and restrict attention to such rules. We present a lower bound on the asymptotic performance of uniformly good allocation rules and construct an allocation scheme that achieves the bound. We discover that despite the inclusion of a switching cost the proposed allocation scheme achieves the same a...
متن کاملFinite-Time Regret Bounds for the Multiarmed Bandit Problem
We show finite-time regret bounds for the multiarmed bandit problem under the assumption that all rewards come from a bounded and fixed range. Our regret bounds after any number T of pulls are of the form a+b logT+c log2 T , where a, b, and c are positive constants not depending on T . These bounds are shown to hold for variants of the popular "-greedy and Boltzmann allocation rules, and for a ...
متن کامل