Learning from eXtreme Bandit Feedback
نویسندگان
چکیده
We study the problem of batch learning from bandit feedback in setting extremely large action spaces. Learning extreme is ubiquitous recommendation systems, which billions decisions are made over sets consisting millions choices a single day, yielding massive observational data. In these large-scale real-world applications, supervised frameworks such as eXtreme Multi-label Classification (XMC) widely used despite fact that they incur significant biases due to mismatch between and labels. Such can be mitigated by importance sampling techniques, but techniques suffer impractical variance when dealing with number actions. this paper, we introduce selective estimator (sIS) operates significantly more favorable bias-variance regime. The sIS obtained performing on conditional expectation reward respect small subset actions for each instance (a form Rao-Blackwellization). employ novel algorithmic procedure---named Policy Optimization Models (POXM)---for XMC tasks. POXM, selected top-p logging policy, where p adjusted data smaller than size space. use supervised-to-bandit conversion three datasets benchmark our POXM method against competing methods: BanditNet, previously applied partial matching pruning strategy, baseline. Whereas BanditNet sometimes improves marginally experiments show systematically all baselines.
منابع مشابه
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...
متن کاملLearning Structured Predictors from Bandit Feedback for Interactive NLP
Structured prediction from bandit feedback describes a learning scenario where instead of having access to a gold standard structure, a learner only receives partial feedback in form of the loss value of a predicted structure. We present new learning objectives and algorithms for this interactive scenario, focusing on convergence speed and ease of elicitability of feedback. We present supervise...
متن کاملBatch learning from logged bandit feedback through counterfactual risk minimization
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...
متن کاملThe Blinded Bandit: Learning with Adaptive Feedback
We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such model of adaptive feedback naturally occurs in scenarios where the environment reacts to the player’s actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, wh...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i10.17058