An efficient algorithm for accurate computation of the Dirichlet-multinomial log-likelihood function
نویسندگان
چکیده
The Dirichlet-multinomial (DMN) distribution is a fundamental model for multicategory count data with overdispersion. This distribution has many uses in bioinformatics including applications to metagenomics data, transctriptomics and alternative splicing. The DMN distribution reduces to the multinomial distribution when the overdispersion parameter ψ is 0. Unfortunately, numerical computation of the DMN log-likelihood function by conventional methods results in instability in the neighborhood of [Formula: see text]. An alternative formulation circumvents this instability, but it leads to long runtimes that make it impractical for large count data common in bioinformatics. We have developed a new method for computation of the DMN log-likelihood to solve the instability problem without incurring long runtimes. The new approach is composed of a novel formula and an algorithm to extend its applicability. Our numerical experiments show that this new method both improves the accuracy of log-likelihood evaluation and the runtime by several orders of magnitude, especially in high-count data situations that are common in deep sequencing data. Using real metagenomic data, our method achieves manyfold runtime improvement. Our method increases the feasibility of using the DMN distribution to model many high-throughput problems in bioinformatics. We have included in our work an R package giving access to this method and a vingette applying this approach to metagenomic data.
منابع مشابه
Accurate Inference for the Mean of the Poisson-Exponential Distribution
Although the random sum distribution has been well-studied in probability theory, inference for the mean of such distribution is very limited in the literature. In this paper, two approaches are proposed to obtain inference for the mean of the Poisson-Exponential distribution. Both proposed approaches require the log-likelihood function of the Poisson-Exponential distribution, but the exact for...
متن کاملA Stick-Breaking Likelihood for Categorical Data Analysis with Latent Gaussian Models
The development of accurate models and efficient algorithms for the analysis of multivariate categorical data are important and longstanding problems in machine learning and computational statistics. In this paper, we focus on modeling categorical data using Latent Gaussian Models (LGMs). We propose a novel logistic stick-breaking likelihood function for categorical LGMs that can exploit recent...
متن کاملMonitoring Multinomial Logit Profiles via Log-Linear Models (Quality Engineering Conference Paper)
In certain statistical process control applications, quality of a process or product can be characterized by a function commonly referred to as profile. Some of the potential applications of profile monitoring are cases where quality characteristic of interest is modelled using binary,multinomial or ordinal variables. In this paper, profiles with multinomial response are studied. For this purpo...
متن کاملMultinomial Dirichlet Gaussian Process Model for Classification of Multidimensional Data
We present probabilistic multinomial Dirichlet classification model for multidimensional data and Gaussian process priors. Here, we have considered efficient computational method that can be used to obtain the approximate posteriors for latent variables and parameters needed to define the multiclass Gaussian process classification model. We first investigated the process of inducing a posterior...
متن کاملThe Smoothed Dirichlet Distribution: Understanding Cross-entropy Ranking in Information Retrieval
THE SMOOTHED DIRICHLET DISTRIBUTION: UNDERSTANDING CROSS-ENTROPY RANKING IN INFORMATION RETRIEVAL SEPTEMBER 2006 RAMESH M. NALLAPATI B.Tech., INDIAN INSTITUTE OF TECHNOLOGY, BOMBAY M.S., UNIVERSITY OF MASSACHUSETTS AMHERST M.S., UNIVERSITY OF MASSACHUSETTS AMHERST Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Prof. James Allan Unigram Language modeling is a successful probabilistic fr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 30 11 شماره
صفحات -
تاریخ انتشار 2014