Unbiased estimates for linear regression via volume sampling
نویسندگان
چکیده
Given a full rank matrix X with more columns than rows, consider the task of estimating the pseudo inverse X based on the pseudo inverse of a sampled subset of columns (of size at least the number of rows). We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times X+>X+. Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of X. We assume labels are expensive and we are only given the labels for the small subset of columns we sample from X. Using our methods we show that the weight vector of the solution for the sub problem is an unbiased estimator of the optimal solution for the whole problem based on all column labels. We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. We use our methods to obtain an algorithm for volume sampling that is faster than state-of-the-art and for obtaining bounds for the total loss of the estimated least-squares solution on all labeled columns.
منابع مشابه
An incidence density sampling program for nested case-control analyses.
BACKGROUND The nested case-control design can be a very efficient approach to an epidemiological investigation. In order to obtain unbiased estimates of relative risk, controls should be selected by incidence density sampling, which involves matching each case to a sample of those who are at risk at the time of case occurrence. METHODS This paper presents a simple computer program for inciden...
متن کاملUsing Inverse Probability Bootstrap Sampling to Eliminate Sample Induced Bias in Model Based Analysis of Unequal Probability Samples
In ecology, as in other research fields, efficient sampling for population estimation often drives sample designs toward unequal probability sampling, such as in stratified sampling. Design based statistical analysis tools are appropriate for seamless integration of sample design into the statistical analysis. However, it is also common and necessary, after a sampling design has been implemente...
متن کاملA New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models
Log-linear models are arguably the most successful class of graphical models for large-scale applications because of their simplicity and tractability. Learning and inference with these models require calculating the partition function, which is a major bottleneck and intractable for large state spaces. Importance Sampling (IS) and MCMC-based approaches are lucrative. However, the condition of ...
متن کاملEstimating Hunting Success Rates via Bayesian Generalized Linear Models
Post-season harvest surveys provide data used in the management of Missouri wildlife. These surveys provide information on the number of animals harvested, hunting pressure and hunter success rate. These estimates provide unbiased results at the statewide level due to the large sample size. However, if this survey information is used to make county estimates, poor results often occur due to sma...
متن کاملLiu Estimates and Influence Analysis in Regression Models with Stochastic Linear Restrictions and AR (1) Errors
In the linear regression models with AR (1) error structure when collinearity exists, stochastic linear restrictions or modifications of biased estimators (including Liu estimators) can be used to reduce the estimated variance of the regression coefficients estimates. In this paper, the combination of the biased Liu estimator and stochastic linear restrictions estimator is considered to overcom...
متن کامل