New Cardinality Estimation Methods for HyperLogLog Sketches
نویسنده
چکیده
is work presents new cardinality estimation methods for data sets recorded by HyperLogLog sketches. A simple derivation of the original estimator was found, that also gives insight how to correct its deciencies. e result is an improved estimator that is unbiased over the full cardinality range, is easy computable, and does not rely on empirically determined data as previous approaches. Based on the maximum likelihood principle a second unbiased estimation method is presented which can also be extended to estimate cardinalities of union, intersection, or relative complements of two sets that are both represented as HyperLogLog sketches. Experimental results show that this approach is more precise than the conventional technique using the inclusion-exclusion principle.
منابع مشابه
New cardinality estimation algorithms for HyperLogLog sketches
This paper presents new methods to estimate the cardinalities of multisets recorded by HyperLogLog sketches. A theoretically motivated extension to the original estimator is presented that eliminates the bias for small and large cardinalities. Based on the maximum likelihood principle a second unbiased method is derived together with a robust and efficient numerical algorithm to calculate the e...
متن کاملBack to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm
We describe a new cardinality estimation algorithm that is extremely space-efficient. It applies one of three novel estimators to the compressed state of the Flajolet-Martin-85 coupon collection process. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, the new algorithm simultaneously wins on all three dimensions of the time/space/accuracy tradeoff. Our proto...
متن کاملHyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the...
متن کاملEfficient cardinality estimation for k-mers in large DNA sequencing data sets
We present an open implementation of the HyperLogLog cardinality estimation sketch for counting fixed-length substrings of DNA strings (“k-mers”). The HyperLogLog sketch implementation is in C++ with a Python interface, and is distributed as part of the khmer software package. khmer is freely available from https://github.com/dib-lab/khmer under a BSD License. The features presented here are in...
متن کاملLogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting
—The information presented in this paper defines LogLog-Beta (LogLog-β). LogLog-β is a new algorithm for estimating cardinalities based on LogLog counting. The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new al...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1706.07290 شماره
صفحات -
تاریخ انتشار 2017