Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits
نویسندگان
چکیده
We present and prove properties of a new offline policy evaluator for an exploration learning setting which is superior to previous evaluators. In particular, it simultaneously and correctly incorporates techniques from importance weighting, doubly robust evaluation, and nonstationary policy evaluation approaches. In addition, our approach allows generating longer histories by careful control of a bias-variance tradeoff, and further decreases variance by incorporating information about randomness of the target policy. Empirical evidence from synthetic and realworld exploration learning problems shows the new evaluator successfully unifies previous approaches and uses information an order of magnitude more efficiently.
منابع مشابه
A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits
We propose a novel evaluation methodology for assessing off-policy learning methods in contextual bandits. In particular, we provide a way to use any given Randomized Control Trial (RCT) to generate a range of observational studies (with synthesized “outcome functions”) that can match the user’s specified degrees of sample selection bias, which can then be used to comprehensively assess a given...
متن کاملOptimal and Adaptive Off-policy Evaluation in Contextual Bandits
We study the off-policy evaluation problem— estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) an...
متن کاملResourceful Contextual Bandits
We study contextual bandits with ancillary constraints on resources, which are common in realworld applications such as choosing ads or dynamic pricing of items. We design the first algorithm for solving these problems that improves over a trivial reduction to the non-contextual case. We consider very general settings for both contextual bandits (arbitrary policy sets, Dudik et al. (2011)) and ...
متن کاملOpen Problem: First-Order Regret Bounds for Contextual Bandits
We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of Õ( √ L?K lnN) where there areK actions,N policies, andL? is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret Õ(L ? poly(K, ln(N/δ))). We describe some positive results, such as an ineff...
متن کاملOn Minimax Optimal Offline Policy Evaluation
This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax risk lower bound, and analyze the risk of two standard estimators. It is shown, and verified in simulation, that one is minimax optimal up to a constant, whi...
متن کامل