Competitive Distribution Estimation: Why is Good-Turing Good
نویسندگان
چکیده
Estimating distributions over large alphabets is a fundamental machine-learning tenet. Yet no method is known to estimate all distributions well. For example, add-constant estimators are nearly min-max optimal but often perform poorly in practice, and practical estimators such as absolute discounting, Jelinek-Mercer, and Good-Turing are not known to be near optimal for essentially any distribution. We describe the first universally near-optimal probability estimators. For every discrete distribution, they are provably nearly the best in the following two competitive ways. First they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the distribution up to a permutation. Second, they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the exact distribution, but as all natural estimators, restricted to assign the same probability to all symbols appearing the same number of times. Specifically, for distributions over k symbols and n samples, we show that for both comparisons, a simple variant of Good-Turing estimator is always within KL divergence of (3 + o(1))/n from the best estimator, and that a more involved estimator is within Õ(min(k/n, 1/ √ n)). Conversely, we show that any estimator must have a KL divergence ≥ Ω̃(min(k/n, 1/n)) over the best estimator for the first comparison, and ≥ Ω̃(min(k/n, 1/ √ n)) for the second.
منابع مشابه
Distribution-Dependent Performance of the Good-Turing Estimator for the Missing Mass
The Good-Turing estimator for the missing mass has certain bias and concentration properties which define its performance. In this paper we give distribution-dependent conditions under which this performance can or cannot be matched by a trivial estimator, that is one which does not depend on observation. We introduce the notion of accrual function for a distribution, and derive our conditions ...
متن کاملAlways Good Turing: Asymptotically Optimal Probability Estimation
While deciphering the Enigma code, Good and Turing derived an unintuitive, yet effective, formula for estimating a probability distribution from a sample of data. We define the attenuation of a probability estimator as the largest possible ratio between the per-symbol probability assigned to an arbitrarily long sequence by any distribution, and the corresponding probability assigned by the esti...
متن کاملSequence Probability Estimation for Large Alphabets
We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many letters are unseen and the empirical distribution tends to overestimate the probability of the observed letter...
متن کاملWhy and How Is Compassion Necessary to Provide Good Healthcare? Comments From an Academic Physician; Comment on “Why and How Is Compassion Necessary to Provide Good Quality Healthcare?”
This is a short commentary to the editorial issued by Marianna Fotaki, entitled: “Why and how is compassion necessary to provide good quality healthcare.” It introduces the necessity of a more cognitive approach to explore further the determinants of behavior towards compassionate care. It raises questions about the importance of training towards a more patient-care and values driven healthcare...
متن کاملWhy Good Quality Care Needs Philosophy More Than Compassion; Comment on “Why and How Is Compassion Necessary to Provide Good Quality Healthcare?”
Although Marianna Fotaki’s Editorial is helpful and challenging by looking at both the professional and institutional requirements for reinstalling compassion in order to aim for good quality healthcare, the causes that hinder this development remain unexamined. In this commentary, 3 causes are discussed; the boundary between the moral and the political; Neoliberalism; and the underdevelopment ...
متن کامل