A Framework for Estimating Stream Expression Cardinalities
نویسندگان
چکیده
Given m distributed data streams A1, . . . ,Am, we consider the problem of estimating the number of uniqueidentifiers in streams defined by set expressions over A1, . . . ,Am. We identify a broad class of algorithms for solvingthis problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfystrong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrateits generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoffbetween accuracy, space usage, update speed, and applicability.
منابع مشابه
A Simple and Efficient Estimation Method for Stream Expression Cardinalities
Estimating the cardinality (i.e. number of distinct elements) of an arbitrary set expression defined over multiple distributed streams is one of the most fundamental queries of interest. Earlier methods based on probabilistic sketches have focused mostly on the sketching algorithms. However, the estimators do not fully utilize the information in the sketches and thus are not statistically effic...
متن کاملA Signal Processing Approach to Estimate Underwater Network Cardinalities with Lower Complexity
An inspection of signal processing approach in order to estimate underwater network cardinalities is conducted in this research. A matter of key prominence for underwater network is its cardinality estimation as the number of active cardinalities varies several times due to numerous natural and artificial reasons due to harsh underwater circumstances. So, a proper estimation technique is mandat...
متن کاملHyperMinHash: Jaccard index sketching in LogLog space
In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard logn-space MinHash by building off of a HyperLogLog count-distinct sketch. For a multiplicative approximation error 1+ on a Jaccard index t, given a ...
متن کاملA Bayesian Approach to Estimating the Selectivity of Conjunctive Predicates
Cost-based optimizers in relational databases make use of data statistics to estimate intermediate result cardinalities. Those cardinalities are needed to estimate access plan costs in order to choose the cheapest plan for executing a query. Since statistics are usually collected on single attributes only, the optimizer can not directly estimate result cardinalities of conjunctive predicates ov...
متن کاملMTS Sketch for Accurate Estimation of Set-Expression Cardinalities from Small Samples
Sketch-based streaming algorithms allow efficient processing of big data. These algorithms use small fixed-size storage to store a summary (“sketch”) of the input data, and use probabilistic algorithms to estimate the desired quantity. However, in many real-world applications it is impractical to collect and process the entire data stream; the common practice is thus to sample and process only ...
متن کامل