Assessing Model Fit with Synthetic vs. Real Data
نویسندگان
چکیده
Assessing whether a model is a good fit to the data is non trivial. The standard practice is to compare a few machine learning techniques to learn a model from data, and pick the one with the highest predictive performance. The winner is considered the best fitting model. But each model may involve different machine learning algorithms that carry their own set of parameters and constraints imposed on the corresponding model. This results in a large space in which to explore model performance. The actual best fitting model may have been overlooked due to an unfortunate choice of the algorithm's parameters. We address this issue by complementing performance model comparison with a method that combines real data with synthetic data generated with the competing models. We naturally expect a model to perform best over its corresponding synthetic data, but the analysis across the other synthetic data sets provides some indication of the models generality and robustness under different assumptions about the data. Results of our investigation in the domain of educational data mining show that a model performance is, as expected, best when tested over synthetic data generated aligned with this model. But we observe much greater performance contrasts across synthetic data than across real data. The performance pattern of each model over a given synthetic data set results in a kind of “signature”. We discuss the significance of this signature to assess model fit, and whether it can provide cues to the data's underlying ground truth.
منابع مشابه
Goodness of Fit of Skills Assessment Approaches: Insights from Patterns of Real vs. Synthetic Data Sets
This study investigates the issue of the goodness of fit of different skills assessment models using both synthetic and real data. Synthetic data is generated from the different skills assessment models. The results show wide differences of performances between the skills assessment models over synthetic data sets. The set of relative performances for the different models create a kind of “sign...
متن کاملOn the Canonical-Based Goodness-of-fit Tests for Multivariate Skew-Normality
It is well-known that the skew-normal distribution can provide an alternative model to the normal distribution for analyzing asymmetric data. The aim of this paper is to propose two goodness-of-fit tests for assessing whether a sample comes from a multivariate skew-normal (MSN) distribution. We address the problem of multivariate skew-normality goodness-of-fit based on the empirical Laplace tra...
متن کاملUsing of frailty model baseline proportional hazard rate in Real Data Analysis
Many populations encountered in survival analysis are often not homogeneous. Individuals are flexible in their susceptibility to causes of death, response to treatment and influence of various risk factors. Ignoring this heterogeneity can result in misleading conclusions. To deal with these problems, the proportional hazard frailty model was introduced. In this paper, the frailty model is ex...
متن کاملAssessing positive matrix factorization model fit: a new method to estimate uncertainty and bias in factor contributions at the measurement time scale
A Positive Matrix Factorization receptor model for aerosol pollution source apportionment was fit to a synthetic dataset simulating one year of daily measurements of ambient PM2.5 concentrations, comprised of 39 chemical species from nine pollutant sources. A novel method was developed to estimate model fit uncertainty and bias at the daily time scale, as related to factor contributions. A circ...
متن کاملAssessing positive matrix factorization model fit: a new method to estimate uncertainty and bias in factor contributions at the daily time scale
A Positive Matrix Factorization receptor model for aerosol pollution source apportionment was fit to a synthetic dataset simulating one year of daily measurements of ambient PM2.5 concentrations, comprised of 39 chemical species from nine pollutant sources. A novel method was developed to estimate model fit uncertainty and bias at 5 the daily time scale, as related to factor contributions. A ba...
متن کامل