Bagging GLM: Improved generalized linear model for the analysis of zero-inflated data
نویسندگان
چکیده
In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier's archiving and manuscript policies are encouraged to visit: a b s t r a c t a r t i c l e i n f o Species-occurrence data sets tend to contain a large proportion of zero values, i.e., absence values (zero-inflated). Statistical inference using such data sets is likely to be inefficient or lead to incorrect conclusions unless the data are treated carefully. In this study, we propose a new modeling method to overcome the problems caused by zero-inflated data sets that involves a regression model and a machine-learning technique. We combined a generalized liner model (GLM), which is widely used in ecology, and bootstrap aggregation (bagging), a machine-learning technique. We established distribution models of Vincetoxicum pycnostelma (a vascular plant) and Ninox scutulata (an owl), both of which are endangered and have zero-inflated distribution patterns, using our new method and traditional GLM and compared model performances. At the same time we modeled four theoretical data sets that contained different ratios of presence/absence values using new and traditional methods and also compared model performances. For distribution models, our new method showed good performance compared to traditional GLMs. After bagging, area under the curve (AUC) values were almost the same as with traditional methods, but sensitivity values were higher. Additionally, our new method showed high sensitivity values compared to the traditional GLM when modeling a theoretical data set containing a large proportion of zero values. These results indicate that our new method has high predictive ability with presence data when analyzing zero-inflated data sets. Generally, predicting presence data is more difficult than predicting absence data. Our new modeling method has potential for advancing species distribution modeling.
منابع مشابه
Assessment of length of stay in a general surgical unit using a zero-inflated generalized Poisson regression
Background: The effective use of limited health care resources is of prime importance. Assessing the length of stay (LOS) is especially important in organizing hospital services and health system. This study was conducted to identify predictors of LOS among patients who were admitted to a general surgical unit. Methods: In this cross-sectional study, the sample included all patien...
متن کاملHurdle, Inflated Poisson and Inflated Negative Binomial Regression Models for Analysis of Count Data with Extra Zeros
In this paper, we propose Hurdle regression models for analysing count responses with extra zeros. A method of estimating maximum likelihood is used to estimate model parameters. The application of the proposed model is presented in insurance dataset. In this example, there are many numbers of claims equal to zero is considered that clarify the application of the model with a zero-inflat...
متن کاملThe negative binomial-Lindley generalized linear model: characteristics and application using crash data.
There has been a considerable amount of work devoted by transportation safety analysts to the development and application of new and innovative models for analyzing crash data. One important characteristic about crash data that has been documented in the literature is related to datasets that contained a large amount of zeros and a long or heavy tail (which creates highly dispersed data). For s...
متن کاملGeneralized Estimating Equations for Zero-Inflated Spatial Count Data
We consolidate the zero-inflated Poisson model for count data with excess zeros (Lambert, 1992) and the two-component model approach for serial correlation among repeated observations (Dobbie and Welsh, 2001) for spatial count data. This concurrently addresses the problem of overdispersion and distinguishes zeros that arise due to random sampling from those that arise due to inherent characteri...
متن کاملZero-inflated negative binomial modeling, efficiency for analysis of length of maternity hospitalization
Background: Mothers’ delivery is one of the most common hospitalization factors throughout the world and it’s modeling can explain distribution and effective factors on rising and decreasing of it. The objective of the present study was a suitable modeling for mother hospitalization time and comparing it with different models. Materials & Methods: Present study is an observational and cross-s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Ecological Informatics
دوره 6 شماره
صفحات -
تاریخ انتشار 2011