Sample Selection Bias Correction Theory
نویسندگان
چکیده
This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effect of an error in that estimation on the accuracy of the hypothesis returned by the learning algorithm for two estimation techniques: a cluster-based estimation technique and kernel mean matching. We also report the results of sample bias correction experiments with several data sets using these techniques. Our analysis is based on the novel concept of distributional stability which generalizes the existing concept of point-based stability. Much of our work and proof techniques can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm.
منابع مشابه
Correcting sample selection bias in maximum entropy density estimation
We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches. The first one takes advantage of unbiased sufficient statistics which can be obtained from biased samples. The second one estimates the biased distribution and then factors the bias out. The third one approximates the second by only using sample...
متن کاملBias Correction in Small Sample from Big Data
This paper discusses the bias problem when estimating the population size of big data such as online social networks (OSN) using uniform random sampling and simple random walk. Unlike the traditional estimation problem where the sample size is not very small relative to the data size, in big data a small sample relative to the data size is already very large and costly to obtain. We point out t...
متن کاملEstimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
In search advertising, the search engine needs to select the most profitable advertisements to display, which can be formulated as an instance of online learning with partial feedback, also known as the stochastic multi-armed bandit (MAB) problem. In this paper, we show that the naive application of MAB algorithms to search advertising for advertisement selection will produce sample selection b...
متن کاملThe effects of sample selection bias on racial differences in child abuse reporting.
OBJECTIVE The aim was to examine whether design features of Wave 1, 1980 National Incidence Study (NIS) data resulted in sample selection bias when certain victims of maltreatment were excluded. METHOD Logistic regression models for the probability of child abuse reports to the child protective services (CPS) were estimated using maximum likelihood methods for Black (n = 511) and White (n = 2...
متن کاملThe Economic Value of Reject Inference in Credit Scoring
We use data with complete information on both rejected and accepted bank loan applicants to estimate the value of sample bias correction using Heckman’s two-stage model with partial observability. In the credit scoring domain such correction is called reject inference. We validate the model performances with and without the correction of sample bias by various measurements. Results show that it...
متن کامل