Maximum Entropy Density Estimation and Modeling Geographic Distributions of Species
نویسنده
چکیده
Maximum entropy (maxent) approach, formally equivalent to maximum likelihood, is a widely used density-estimation method. When input datasets are small, maxent is likely to overfit. Overfitting can be eliminated by various smoothing techniques, such as regularization and constraint relaxation, but theory explaining their properties is often missing or needs to be derived for each case separately. In this dissertation, we propose a unified treatment for a large and general class of smoothing techniques. We provide fully general guarantees on their statistical performance and propose optimization algorithms with complete convergence proofs. As special cases, we can easily derive performance guarantees for many known regularization types including L1 and L2-squared regularization. Furthermore, our general approach enables us to derive entirely new regularization functions with superior statistical guarantees. The new regularization functions use information about the structure of the feature space, incorporate information about sample selection bias, and combine information across several related density-estimation tasks. We propose algorithms solving a large and general subclass of generalized maxent problems, including all discussed in the dissertation, and prove their convergence. Our convergence proofs generalize techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. As an application of maxent, we discuss an important problem in ecology and conservation: the problem of modeling geographic distributions of species. Here, small sample sizes hinder accurate modeling of rare and endangered species. Generalized maxent offers several advantages over previous techniques. In particular, generalized maxent addresses the problem in a statistically sound manner and allows principled extensions to situations when data collection is biased or when we have access to data on many related species. The utility of our unified approach is demonstrated in comprehensive experiments on large real-world datasets. We find that generalized maxent is among the best-performing species-distribution modeling techniques. Our experiments also show that the contributions of this dissertation, i.e., regularization strategies, bias-removal approaches, and multiple-estimation techniques, all significantly improve the predictive performance of maxent.
منابع مشابه
Modeling of the Maximum Entropy Problem as an Optimal Control Problem and its Application to Pdf Estimation of Electricity Price
In this paper, the continuous optimal control theory is used to model and solve the maximum entropy problem for a continuous random variable. The maximum entropy principle provides a method to obtain least-biased probability density function (Pdf) estimation. In this paper, to find a closed form solution for the maximum entropy problem with any number of moment constraints, the entropy is consi...
متن کاملA Note on the Bivariate Maximum Entropy Modeling
Let X=(X1 ,X2 ) be a continuous random vector. Under the assumption that the marginal distributions of X1 and X2 are given, we develop models for vector X when there is partial information about the dependence structure between X1 and X2. The models which are obtained based on well-known Principle of Maximum Entropy are called the maximum entropy (ME) mo...
متن کاملUsing ecological niche modeling to determine avian richness hotspots
Understanding distributions of wildlife species is a key step towards identifying biodiversity hotspots and designing effective conservation strategies. In this paper, the spatial pattern of diversity of birds in Golestan Province, Iran was estimated. Ecological niche modeling was used to determine distributions of 144 bird species across the province using a maximum entropy algorithm. Richness...
متن کاملMethods of Data Analysis Working with probability distributions
One of the key problems in non-parametric data analysis is to create a good model of a generating probability distribution, assuming we are given as data a finite sample from that distribution. Obviously this problem is ill-posed for continuous distributions: with finite data, there is no way to distinguish between (or exclude) distributions that are not restricted to be smooth. The question th...
متن کاملQuasi-continuous maximum entropy distribution approximation with kernel density
This paper extends maximum entropy estimation of discrete probability distributions to the continuous case. This transition leads to a nonparametric estimation of a probability density function, preserving the maximum entropy principle. Furthermore, the derived density estimate provides a minimum mean integrated square error. In a second step it is shown, how boundary conditions can be included...
متن کامل