Extensive Large-Scale Study of Error in Samping-Based Distinct Value Estimators for Databases

نویسندگان

  • Vinay Deolalikar
  • Hernan Laffitte
چکیده

The problem of distinct value estimation has many applications. Being a critical component of query optimizers in databases, it also has high commercial impact. Many distinct value estimators have been proposed, using various statistical approaches. However, characterizing the errors incurred by these estimators is an open problem: existing analytical approaches are not powerful enough, and extensive empirical studies at large scale do not exist. We conduct an extensive large-scale empirical study of 11 distinct value estimators from four different approaches to the problem over families of Zipfian distributions whose parameters model real-world applications. Our study is the first that scales to the size of a billion-rows that today’s large commercial databases have to operate in. This allows us to characterize the error that is encountered in real-world applications of distinct value estimation. By mining the generated data, we show that estimator error depends on a key latent parameter — the average uniform class size — that has not been studied previously. This parameter also allows us to unearth error patterns that were previously unknown. Importantly, ours is the first approach that provides a framework for ∗This is the full-length version of a shorter published paper, and includes supplementary material for the published paper. Please cite as “Vinay Deolalikar and Hernan Laffitte: Extensive LargeScale Study of Error in Samping-Based Distinct Value Estimators for Databases, IEEE Big Data Conference, Washington DC, December 2016.” †contact author, [email protected], work done at HP Labs, Palo Alto. ‡[email protected]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classic and Bayes Shrinkage Estimation in Rayleigh Distribution Using a Point Guess Based on Censored Data

Introduction      In classical methods of statistics, the parameter of interest is estimated based on a random sample using natural estimators such as maximum likelihood or unbiased estimators (sample information). In practice,  the researcher has a prior information about the parameter in the form of a point guess value. Information in the guess value is called as nonsample information. Thomp...

متن کامل

Comparison of Small Area Estimation Methods for Estimating Unemployment Rate

Extended Abstract. In recent years, needs for small area estimations have been greatly increased for large surveys particularly household surveys in Sta­ tistical Centre of Iran (SCI), because of the costs and respondent burden. The lack of suitable auxiliary variables between two decennial housing and popula­ tion census is a challenge for SCI in using these methods. In general, the...

متن کامل

Sampling-Based Estimation of the Number of Distinct Values of an Attribute

We provide several new sampling-based estimators of the number of distinct values of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attribute-value distributions drawn from a variety of real-world databases. This appears to be the first extensive comparison of distinct-value estimators i...

متن کامل

Post hoc power estimation in large-scale multiple testing problems

BACKGROUND The statistical power or multiple Type II error rate in large-scale multiple testing problems as, for example, in gene expression microarray experiments, depends on typically unknown parameters and is therefore difficult to assess a priori. However, it has been suggested to estimate the multiple Type II error rate post hoc, based on the observed data. METHODS We consider a class of...

متن کامل

Pitman-Closeness of Preliminary Test and Some Classical Estimators Based on Records from Two-Parameter Exponential Distribution

In this paper, we study the performance of estimators of parametersof two-parameter exponential distribution based on upper records. The generalized likelihood ratio (GLR) test was used to generate preliminary test estimator (PTE) for both parameters. We have compared the proposed estimator with maximum likelihood (ML) and unbiased estimators (UE) under mean-squared error (MSE) and Pitman me...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1612.00476  شماره 

صفحات  -

تاریخ انتشار 2016