A Framework for Estimating Stream Expression Cardinalities

نویسندگان

  • Anirban Dasgupta
  • Kevin J. Lang
  • Lee Rhodes
  • Justin Thaler
چکیده

Given m distributed data streams A1, . . . ,Am, we consider the problem of estimating the number of uniqueidentifiers in streams defined by set expressions over A1, . . . ,Am. We identify a broad class of algorithms for solvingthis problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfystrong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrateits generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoffbetween accuracy, space usage, update speed, and applicability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Simple and Efficient Estimation Method for Stream Expression Cardinalities

Estimating the cardinality (i.e. number of distinct elements) of an arbitrary set expression defined over multiple distributed streams is one of the most fundamental queries of interest. Earlier methods based on probabilistic sketches have focused mostly on the sketching algorithms. However, the estimators do not fully utilize the information in the sketches and thus are not statistically effic...

متن کامل

A Signal Processing Approach to Estimate Underwater Network Cardinalities with Lower Complexity

An inspection of signal processing approach in order to estimate underwater network cardinalities is conducted in this research. A matter of key prominence for underwater network is its cardinality estimation as the number of active cardinalities varies several times due to numerous natural and artificial reasons due to harsh underwater circumstances. So, a proper estimation technique is mandat...

متن کامل

HyperMinHash: Jaccard index sketching in LogLog space

In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard logn-space MinHash by building off of a HyperLogLog count-distinct sketch. For a multiplicative approximation error 1+ on a Jaccard index t, given a ...

متن کامل

A Bayesian Approach to Estimating the Selectivity of Conjunctive Predicates

Cost-based optimizers in relational databases make use of data statistics to estimate intermediate result cardinalities. Those cardinalities are needed to estimate access plan costs in order to choose the cheapest plan for executing a query. Since statistics are usually collected on single attributes only, the optimizer can not directly estimate result cardinalities of conjunctive predicates ov...

متن کامل

MTS Sketch for Accurate Estimation of Set-Expression Cardinalities from Small Samples

Sketch-based streaming algorithms allow efficient processing of big data. These algorithms use small fixed-size storage to store a summary (“sketch”) of the input data, and use probabilistic algorithms to estimate the desired quantity. However, in many real-world applications it is impractical to collect and process the entire data stream; the common practice is thus to sample and process only ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016