Learning from eXtreme Bandit Feedback

نویسندگان

چکیده

We study the problem of batch learning from bandit feedback in setting extremely large action spaces. Learning extreme is ubiquitous recommendation systems, which billions decisions are made over sets consisting millions choices a single day, yielding massive observational data. In these large-scale real-world applications, supervised frameworks such as eXtreme Multi-label Classification (XMC) widely used despite fact that they incur significant biases due to mismatch between and labels. Such can be mitigated by importance sampling techniques, but techniques suffer impractical variance when dealing with number actions. this paper, we introduce selective estimator (sIS) operates significantly more favorable bias-variance regime. The sIS obtained performing on conditional expectation reward respect small subset actions for each instance (a form Rao-Blackwellization). employ novel algorithmic procedure---named Policy Optimization Models (POXM)---for XMC tasks. POXM, selected top-p logging policy, where p adjusted data smaller than size space. use supervised-to-bandit conversion three datasets benchmark our POXM method against competing methods: BanditNet, previously applied partial matching pruning strategy, baseline. Whereas BanditNet sometimes improves marginally experiments show systematically all baselines.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...

متن کامل

Learning Structured Predictors from Bandit Feedback for Interactive NLP

Structured prediction from bandit feedback describes a learning scenario where instead of having access to a gold standard structure, a learner only receives partial feedback in form of the loss value of a predicted structure. We present new learning objectives and algorithms for this interactive scenario, focusing on convergence speed and ease of elicitability of feedback. We present supervise...

متن کامل

Batch learning from logged bandit feedback through counterfactual risk minimization

We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...

متن کامل

The Blinded Bandit: Learning with Adaptive Feedback

We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such model of adaptive feedback naturally occurs in scenarios where the environment reacts to the player’s actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, wh...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i10.17058