Reliable estimation of externally validated prediction errors for QSAR models
نویسندگان
چکیده
In most cases of QSAR modelling the final model used to make predictions, is not known a priori but has to be selected in a data driven fashion (e.g. selection of principal components, variable selection, selection of the best mathematical modelling technique). Reliable estimation of externally validated prediction errors under this model uncertainty is still a challenge in chemoinformatics. To fulfil the standards of external validation, the test data set has to be independent not only from model building but also from model selection. There still is a controversy in the literature how the independent test data set should be chosen and how large it should be. For setting aside a test data set there are basically two different options: 1) a single test data set is set aside and 2) the test data are generated by repeatedly partitioning the available data into test and training set partitions i.e. cross-validation. Since cross-validation uses the data more efficiently, it is to be preferred in particular for small data sets. The aforementioned cross-validation step must not be confused with a cross-validation step that might be necessary to select the model! If model selection is also done by cross-validation two loops of cross-validation are necessary [1]. In the inner loop, cross-validation is employed for model selection [2] (also referred to as internal validation) while in the outer loop of cross-validation different test data sets are generated repeatedly that are used to assess the readily selected models (external validation). In this contribution double cross-validation is evaluated for its ability to estimate prediction errors under model uncertainty. Depending on how double cross-validation is parameterized (test set size, number of repetitions), it either yields biased or highly variable estimates of the prediction error. The sources of bias and variability will be highlighted and recommendations are provided how to determine the test set size in order to obtain a favourable bias-variability trade-off.
منابع مشابه
Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
BACKGROUND Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging - especially under model uncertainty - and requires independent test objects. These test objects must...
متن کاملA Novel QSAR Model for the Evaluation and Prediction of (E)-N’-Benzylideneisonicotinohydrazide Derivatives as the Potent Anti-mycobacterium Tuberculosis Antibodies Using Genetic Function Approach
Abstract A dataset of (E)-N’-benzylideneisonicotinohydrazide derivatives as a potent anti-mycobacterium tuberculosis has been investigated utilizing Quantitative Structure-Activity Relationship (QSAR) techniques. Genetic Function Algorithm (GFA) and Multiple Linear Regression Analysis (MLRA) were used to select the descriptors and to generate the correlation QSAR models that relate the Mi...
متن کاملPredicting Binding Affinity of CSAR Ligands Using Both Structure-Based and Ligand-Based Approaches
We report on the prediction accuracy of ligand-based (2D QSAR) and structure-based (MedusaDock) methods used both independently and in consensus for ranking the congeneric series of ligands binding to three protein targets (UK, ERK2, and CHK1) from the CSAR 2011 benchmark exercise. An ensemble of predictive QSAR models was developed using known binders of these three targets extracted from the ...
متن کاملQSAR Prediction of Half-Life, Nondimentional Eeffective Degradation Rate Constant and Effective Péclet Number of Volatile Organic Compounds
In this work some quantitative structure activity relationship models were developed for prediction of three bioenvironmental parameters of 28 volatile organic compounds, which are used in assessing the behavior of pollutants in soil. These parameters are; half-life, non dimensional effective degradation rate constant and effective Péclet number in two type of soil. The most effective descripto...
متن کاملAssessment of Machine Learning Reliability Methods for Quantifying the Applicability Domain of QSAR Regression Models
The vastness of chemical space and the relatively small coverage by experimental data recording molecular properties require us to identify subspaces, or domains, for which we can confidently apply QSAR models. The prediction of QSAR models in these domains is reliable, and potential subsequent investigations of such compounds would find that the predictions closely match the experimental value...
متن کامل