A methodology to design heuristics for model selection based on the characteristics of data: Application to investigate when the Negative Binomial Lindley (NB-L) is preferred over the Negative Binomial (NB).
نویسندگان
چکیده
Safety analysts usually use post-modeling methods, such as the Goodness-of-Fit statistics or the Likelihood Ratio Test, to decide between two or more competitive distributions or models. Such metrics require all competitive distributions to be fitted to the data before any comparisons can be accomplished. Given the continuous growth in introducing new statistical distributions, choosing the best one using such post-modeling methods is not a trivial task, in addition to all theoretical or numerical issues the analyst may face during the analysis. Furthermore, and most importantly, these measures or tests do not provide any intuitions into why a specific distribution (or model) is preferred over another (Goodness-of-Logic). This paper ponders into these issues by proposing a methodology to design heuristics for Model Selection based on the characteristics of data, in terms of descriptive summary statistics, before fitting the models. The proposed methodology employs two analytic tools: (1) Monte-Carlo Simulations and (2) Machine Learning Classifiers, to design easy heuristics to predict the label of the 'most-likely-true' distribution for analyzing data. The proposed methodology was applied to investigate when the recently introduced Negative Binomial Lindley (NB-L) distribution is preferred over the Negative Binomial (NB) distribution. Heuristics were designed to select the 'most-likely-true' distribution between these two distributions, given a set of prescribed summary statistics of data. The proposed heuristics were successfully compared against classical tests for several real or observed datasets. Not only they are easy to use and do not need any post-modeling inputs, but also, using these heuristics, the analyst can attain useful information about why the NB-L is preferred over the NB - or vice versa- when modeling data.
منابع مشابه
The negative binomial-Lindley generalized linear model: characteristics and application using crash data.
There has been a considerable amount of work devoted by transportation safety analysts to the development and application of new and innovative models for analyzing crash data. One important characteristic about crash data that has been documented in the literature is related to datasets that contained a large amount of zeros and a long or heavy tail (which creates highly dispersed data). For s...
متن کاملA semiparametric negative binomial generalized linear model for modeling over-dispersed count data with a heavy tail: Characteristics and applications to crash data.
Crash data can often be characterized by over-dispersion, heavy (long) tail and many observations with the value zero. Over the last few years, a small number of researchers have started developing and applying novel and innovative multi-parameter models to analyze such data. These multi-parameter models have been proposed for overcoming the limitations of the traditional negative binomial (NB)...
متن کاملAn Alternative to the Beta-Binomial Distribution with Application in Developmental Toxicology
The beta-binomial distribution is resulted when the probability of success per trial in the binomial distribution varies in successive trials and the mixing distribution is from the beta family. For experiments with binary outcomes, often it may happen that observations exhibit some extra binomial variation and occur in clusters. In such experiments the beta-binomial distribution can generally ...
متن کاملZero inflated Poisson and negative binomial regression models: application in education
Background: The number of failed courses and semesters in students are indicatorsof their performance. These amounts have zero inflated (ZI) distributions. Using ZI Poisson and negative binomial distributions we can model these count data to find the associated factors and estimate the parameters. This study aims at to investigate the important factors related to the educational performance of ...
متن کاملFitting of Count Time Series Models on the Number of Patients Referred to Addiction Treatment Centers in Semnan County
Abstract. Count data over time are observed in many application areas. Many researchers use time series patterns to analyze this data. In this paper, the poisson count time series linear models and negative binomials on this type of data with the explanatory variables are studied. The Likelihood analysis and the evaluation of count time series model based on generalized linear models are pres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Accident; analysis and prevention
دوره 107 شماره
صفحات -
تاریخ انتشار 2017