- 1 - Poisson Mixtures
نویسندگان
چکیده
Shannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon’s theory to introduce a ‘‘bag-of-words’’ assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Γ distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2x ≥ 1 )). 1. Problem: Word Rates Are Highly Variable Many applications of statistical natural language processing make use of a so-called ‘‘bag-of-words’’ assumption. Of course, is well known that word rates depend on many factors: genre, author, topic, etc. Table 1, for example, shows that ‘‘said’’ is more frequent in some types of texts and less frequent in others. Table 1: Frequency of ‘‘said’’ Depends on Source Source Freq per million words _ ______________________________________________ Department of Energy Abstracts 41 Groliers Encyclopedia 64 Federalist Papers 287 Hansard 1072 Harper & Row Books 1632 Brown Corpus 1645 Wall Street Journal 5600 Associated Press 1994 8514 Associated Press 1987 9525 Associated Press 1991 9861 Associated Press 1990 10,040 Associated Press 1989 10,195 Associated Press 1988 10,313 The million-word Brown Corpus (Francis and Kucera, 1982) was constructed in the 1960s to help researchers better understand how word rates vary from document to document and genre to genre. The corpus consists of 500 excerpts of approximately 2000 words each, selected from a wide variety of genres: Press (documents 1-88), Religion (89-105), Hobbies (106-141), Popular Lore (142-189), Belle-Lettres (190-264), Government and House Organs (265-294), Learned (295-374), Fiction (375-491), and Humor (492-500). Figures 1 and 2 demonstrate that this structure has a dramatic impact on the frequency of ‘‘said.’’ Figure 1 shows the frequency of ‘‘said’’ in each of the 500 documents. Figure 2 is similar except that the Brown Corpus was replaced by a corpus of 500 documents randomly generated by a binomial distribution:
منابع مشابه
ON MODALITY AND DIVISIBILITY OF POISSON AND BINOMIAL MIXTURES
Some structural aspects of mixtures, in general, have been previously investigated by the author in [I] and [2]. The aim of this article is to investigate some important structural properties of the special cases of Poisson and binomial mixtures in detail. Some necessary and sufficient conditions are arrived at for different modality and divisibility properties of a Poisson mixture based o...
متن کاملAdditions and Corrections "M.H.Alamatsaz- On Modality and Divisibility of Poisson and Binomial Mixtures. J.Sci. I.R. Iran, Vol.l,No.3, Spring 1990
متن کامل
On Poisson–Tweedie mixtures
*Correspondence: [email protected] 1Department of Mathematics, Ohio University, Athens, OH, USA Full list of author information is available at the end of the article Abstract Poisson-Tweedie mixtures are the Poisson mixtures for which the mixing measure is generated by those members of the family of Tweedie distributions whose support is non-negative. This class of non-negative integer-valued ...
متن کاملBayesian Analysis of Finite Poisson Mixtures
Finite Poisson mixtures are widely used to model overdispersed data sets for which the simple Poisson distribution is inadequate. Such data sets are very common in real applications. In this paper we investigate Bayesian estimation via MCMC for finite Poisson mixtures and we discuss some computational issues. The related problem of determining the number of components in a mixture is also treat...
متن کاملMixtures of compound Poisson processes as models of tick-by-tick financial data
A model for the phenomenological description of tick-by-tick share prices in a stock exchange is introduced. It is based on mixtures of compound Poisson processes. Preliminary results based on Monte Carlo simulation show that this model can reproduce various stylized facts.
متن کاملPackage ‘ poisson . glm . mix ’
February 20, 2015 Type Package Title Fit high dimensional mixtures of Poisson GLMs Version 1.2 Date 2014-04-17 Author Panagiotis Papastamoulis, Marie-Laure Martin-Magniette, Cathy Maugis-Rabusseau Maintainer Panagiotis Papastamoulis Description High dimensional mixtures of Poisson Generalized Linear models with three different parameterizations of Poisson means are considere...
متن کامل