Quantifying model errors using similarity to training data
نویسندگان
چکیده
When making a prediction with a statistical model, it is not sufficient to know that the model is “good”, in the sense that it is able to make accurate predictions on test data. Another relevant question is: How good is the model for a specific sample whose properties we wish to predict? Stated another way: Is the sample within or outside the model’s domain of applicability or what is the degree to which a test compound is within the model’s domain of applicability. Numerous studies have been done on determining appropriate measures to address this question [1-4]. Here we focus on a derivative question: Can we determine an applicability domain measure suitable for deriving quantitative error bars – that is, error bars which accurately reflect the expected error when making predictions for specified values of the domain measure? Such a measure could then be used to provide an indication of the confidence in a given prediction (i.e. the likely error in a prediction based on to what degree the test compound is part of the model’s domain of applicability).Ideally, we wish such a measure to be simple to calculate and to understand, to apply to models of all types – including classification and regression models for both molecular and non-molecular data and to be free of adjustable parameters. Consistent with recent work by others [5,6], the measures we have seen that best meet these criteria are distances to individual samples in the training data. We describe our attempts to construct a recipe for deriving quantitative error bars from these distances.
منابع مشابه
A New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملطراحی الگوی مدیریَت اطلاعات بهداشتی در مراکز سالمندان ایران،1385
Introduction: Nursing care facilities are among a variety of health care services. Nursing care facilities refers to a broad spectrum of health, social, supportive, medical and rehabilitation cares .People that lives in these facilities can choose their services .Then, nursing care facilities need some professional organizing and standards about health information management. Methods: This is a...
متن کاملPredicting and interpreting identification errors in military vehicle training using multidimensional scaling.
UNLABELLED We compared methods for predicting and understanding the source of confusion errors during military vehicle identification training. Participants completed training to identify main battle tanks. They also completed card-sorting and similarity-rating tasks to express their mental representation of resemblance across the set of training items. We expected participants to selectively a...
متن کامل3D model construction of induced polarization and resistivity data with quantifying uncertainties using geostatistical methods and drilling (Case study: Madan Bozorg, Iran)
Madan Bozorg is an active copper mine located in NE Iran, which is a part of the very wide copper mineralization zone named Miami-Sabzevar copper belt. The main goal of this research work is the 3D model construction of the induced polarization (IP) and resistivity (Rs) data with quantifying the uncertainties using geostatistical methods and drilling. Four profiles were designed and surveyed us...
متن کاملFinite Element Simulation and ANFIS Prediction of Dimensional Error Effect on distribution of BPP/GDL Contact Pressure in PEM Fuel Cell
Distribution of contact pressure between the bipolar plate and gas diffusion layer considerably affect the performance of proton exchange membrane fuel cell. In this regard, an adaptive neuro-fuzzy inference system (ANFIS) is developed to predict the contact pressure distribution on the gas diffusion layer due to dimensional errors of the bipolar plate ribs in a proton exchange membrane fuel ce...
متن کاملRTDGPS Implementation by Online Prediction of GPS Position Components Error Using GA-ANN Model
If both Reference Station (RS) and navigational device in Differential Global Positioning System (DGPS) receive signals from the same satellite, RS Position Components Error (RPCE) can be used to compensate for navigational device error. This research used hybrid method for RPCE prediction which was collected by a low-cost GPS receiver. It is a combination of Genetic Algorithm (GA) computing an...
متن کامل