Scalable Multidimensional Hierarchical Bayesian Modeling on Spark
نویسندگان
چکیده
We consider the problem of estimating occurrence rates of rare events for extremely sparse data using pre-existing hierarchies and selected features to perform inference along multiple dimensions. In particular, we focus on the problem of estimating click rates for {Advertiser, Publisher, User} tuples where both the Advertisers and the Publishers are organized as hierarchies that capture broad contextual information at different levels of granularities. Typically, the click rates are low and the coverage of the hierarchies and dimensions is sparse. To overcome these difficulties, we decompose the joint prior of the three-dimensional Click-Through-Rate (CTR) using tensor decomposition and propose a Multidimensional Hierarchical Bayesian framework (abbreviated as MadHab). We set up a specific framework of each dimension to model dimension-specific characteristics. More specifically, we consider the hierarchical beta process prior for the Advertiser dimension and for the Publisher dimension respectively and a feature-dependent mixture model for the User dimension. Besides the centralized implementation, we propose a distributed algorithm through Spark for inference which make the model highly scalable and suited for large scale data mining applications. We demonstrate that on a real world ads campaign platform our framework can effectively discriminate extremely rare events in terms of their click propensity.
منابع مشابه
Sequential Monte Carlo Bandits
In this paper we propose a flexible and efficient framework for handling multi-armed bandits, combining sequential Monte Carlo algorithms with hierarchical Bayesian modeling techniques. The framework naturally encompasses restless bandits, contextual bandits, and other bandit variants under a single inferential model. Despite the model’s generality, we propose efficient Monte Carlo algorithms t...
متن کاملScalable Nonparametric Bayesian Multilevel Clustering
Multilevel clustering problems where the content and contextual information are jointly clustered are ubiquitous in modern datasets. Existing works on this problem are limited to small datasets due to the use of the Gibbs sampler. We address the problem of scaling up multilevel clustering under a Bayesian nonparametric setting, extending the MC2 model proposed in (Nguyen et al., 2014). We groun...
متن کاملDinTucker: Scaling up Gaussian process models on multidimensional arrays with billions of elements
Infinite Tucker Decomposition (InfTucker) and random function prior models, as nonparametric Bayesian models on infinite exchangeable arrays, are more powerful models than widely-used multilinear factorization methods including Tucker and PARAFAC decomposition, (partly) due to their capability of modeling nonlinear relationships between array elements. Despite their great predictive performance...
متن کاملBayesian time series models and scalable inference
With large and growing datasets and complex models, there is an increasing need for scalable Bayesian inference. We describe two lines of work to address this need. In the first part, we develop new algorithms for inference in hierarchical Bayesian time series models based on the hidden Markov model (HMM), hidden semi-Markov model (HSMM), and their Bayesian nonparametric extensions. The HMM is ...
متن کاملAMIDST: a Java Toolbox for Scalable Probabilistic Machine Learning
The AMIDST Toolbox is a software for scalable probabilistic machine learning with a special focus on (massive) streaming data. The toolbox supports a flexible modeling language based on probabilistic graphical models with latent variables and temporal dependencies. The specified models can be learnt from large data sets using parallel or distributed implementations of Bayesian learning algorith...
متن کامل