Efficient Approximation of Correlated Sums on Data Streams
نویسندگان
چکیده
In many applications such as IP network management, data arrives in streams, and queries over those streams need to be processed online using limited storage. Correlated-sum (CS) aggregates are a natural class of queries formed by composing basic aggregates on (x, y) pairs, and are of the form SUM{g(y) : x ≤ f(AGG(x))}, where AGG(x) can be any basic aggregate and f(), g() are user-specified functions. CSaggregates cannot be computed exactly in one pass through a data stream using limited storage; hence, we study the problem of computing approximate CS-aggregates. We guarantee a priori error bounds when AGG(x) can be computed in limited space (e.g., MIN, MAX, AVG), using two variants of Greenwald and Khanna’s summary structure for the approximate computation of quantiles. Using real data sets, we experimentally demonstrate that an adaptation of the quantile summary structure uses much lesser space, and is significantly faster, than a more direct use of the quantile summary structure, for the same a posteriori error bounds. Finally, we prove that, when AGG(x) is a quantile (which cannot be computed over a data stream in limited space), the error of a CS-aggregate can be arbitrarily large. Index: Correlated aggregates, data streams, approximation, summary structures, a priori error bounds, IP network management.
منابع مشابه
Time-Decayed Correlated Aggregates over Data Streams
Data stream analysis frequently relies on identifying correlations and posing conditional queries on the data after it has been seen. Correlated aggregates form an important example of such queries, which ask for an aggregation over one dimension of stream elements which satisfy a predicate on another dimension. Since recent events are typically more important than older ones, time decay should...
متن کاملInfluence of Stream channel morphology and in-stream habitats on fish community in Golestan province Streams
Four streams with different sizes were selected for studying the effects of environmental factors on fish assemblages using indirect (Detrended Correspondence Analysis, DCA) and direct (Redundancy Analysis, RDA) gradient analysis in Golestan province. DCA of presence-absence and relative abundance data showed well gradient and linear model of species variability. In the within-site RDA, environ...
متن کاملEfficient simulation of tail probabilities of sums of correlated lognormals
We consider the problem of efficient estimation of tail probabilities of sums of correlated lognormals via simulation. This problem is motivated by the tail analysis of portfolios of assets driven by correlated Black-Scholes models. We propose two estimators that can be rigorously shown to be efficient as the tail probability of interest decreases to zero. The first estimator, based on importan...
متن کاملOn the Variance of Subset Sum Estimation
For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries to arbitrary subset sums. With unit weights, we can compute subset sizes which together with the previous sums provide the subset averages. The question addre...
متن کاملSome results of 2-periodic functions by Fourier sums in the space Lp(2)
In this paper, using the Steklov function, we introduce the generalized continuity modulus and denethe class of functions Wr;kp;' in the space Lp. For this class, we prove an analog of the estimates in [1]in the space Lp.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE Trans. Knowl. Data Eng.
دوره 15 شماره
صفحات -
تاریخ انتشار 2003