Constrained regret minimization for multi-criterion multi-armed bandits

نویسندگان

چکیده

We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, merit determined by several, conflicting attributes. The aim to optimize ‘primary’ attribute subject user-provided constraints on other ‘secondary’ assume that attributes can be estimated using samples from arms’ distributions, estimators enjoy suitable concentration properties. propose algorithm called Con-LCB guarantees logarithmic regret, i.e., average number plays all non-optimal arms at most in also outputs boolean flag correctly identifies, high probability, whether instance feasible/infeasible respect constraints. show optimal within universal constant, more sophisticated algorithms cannot do much better universally. Finally, we establish fundamental trade-off between feasibility identification. Our framework finds natural applications, for instance, financial portfolio optimization, where risk maximization expected return meaningful.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bounded regret in stochastic multi-armed bandits

We study the stochastic multi-armed bandit problem when one knows the value μ(⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap ∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows ∆, and bound...

متن کامل

Contextual Multi-Armed Bandits

We study contextual multi-armed bandit problems where the context comes from a metric space and the payoff satisfies a Lipschitz condition with respect to the metric. Abstractly, a contextual multi-armed bandit problem models a situation where, in a sequence of independent trials, an online algorithm chooses, based on a given context (side information), an action from a set of possible actions ...

متن کامل

Staged Multi-armed Bandits

In conventional multi-armed bandits (MAB) and other reinforcement learning methods, the learner sequentially chooses actions and obtains a reward (which can be possibly missing, delayed or erroneous) after each taken action. This reward is then used by the learner to improve its future decisions. However, in numerous applications, ranging from personalized patient treatment to personalized web-...

متن کامل

Mortal Multi-Armed Bandits

We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified ...

متن کامل

Regional Multi-Armed Bandits

We consider a variant of the classic multiarmed bandit problem where the expected reward of each arm is a function of an unknown parameter. The arms are divided into different groups, each of which has a common parameter. Therefore, when the player selects an arm at each time slot, information of other arms in the same group is also revealed. This regional bandit model naturally bridges the non...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Machine Learning

سال: 2023

ISSN: ['0885-6125', '1573-0565']

DOI: https://doi.org/10.1007/s10994-022-06291-9