Planning in Reward-Rich Domains via PAC Bandits

نویسندگان

  • Sergiu Goschin
  • Ari Weinstein
  • Michael L. Littman
  • Erick Chastain
چکیده

In some decision-making environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinite-armed bandit and provide upper and lower bounds on the number of evaluations or “pulls” needed to identify a solution whose evaluation exceeds a given threshold r0. We present several algorithms and use them to identify reliable strategies for solving screens from the video games Infinite Mario and Pitfall! We show order of magnitude improvements in sample complexity over a natural approach that pulls each arm until a good estimate of its success probability is known.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PAC Bandits with Risk Constraints

We study the problem of best arm identification with risk constraints within the setting of fixed confidence pure exploration bandits (PAC bandits). The goal is to stop as fast as possible, and with high confidence return an arm whose mean is -close to the best arm among those that satisfy a risk constraint, namely their α-quantile functions are larger than a threshold β. For this risk-sensitiv...

متن کامل

Skyline Identification in Multi-Armed Bandits

We introduce a variant of the classical PAC multi-armed bandit problem. There is an ordered set of n arms A[1], . . . , A[n], each with some stochastic reward drawn from some unknown bounded distribution. The goal is to identify the skyline of the set A, consisting of all arms A[i] such that A[i] has larger expected reward than all lower-numbered arms A[1], . . . , A[i− 1]. We define a natural ...

متن کامل

Multi-Armed Bandits, Gittins Index, and its Calculation

Multi-armed bandit is a colorful term that refers to the di lemma faced by a gambler playing in a casino with multiple slot machines (which were colloquially called onearmed bandits). W h a t strategy should a gambler use to pick the machine to play next? It is the one for which the posterior mean of winning is the highest and thereby maximizes current expected reward, or the one for which the ...

متن کامل

Modal Bandits

Analyses of multi-armed bandits primarily presume that the value of an arm is its expected reward. We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012