Scalable Splitting of Massive Data Streams
نویسندگان
چکیده
Scalable execution of continuous queries over massive data streams often requires splitting input streams into parallel sub-streams over which query operators are executed in parallel. Automatic stream splitting is in general very difficult, as the optimal parallelization may depend on application semantics. To enable application specific stream splitting, we introduce splitstream functions where the user specifies non-procedural stream partitioning and replication. For high-volume streams, the stream splitting itself becomes a performance bottleneck. A cost model is introduced that estimates the performance of splitstream functions with respect to throughput and CPU usage. We implement parallel splitstream functions, and relate experimental results to cost model estimates. Based on the results, a splitstream function called autosplit is proposed, which scales well for high degrees of parallelism, and is robust for varying proportions of stream partitioning and replication. We show how user defined parallelization using autosplit provides substantially improved scalability (L = 64) over previously published results for the Linear Road Benchmark.
منابع مشابه
Massive Scale-out of Expensive Continuous Queries
Scalable execution of expensive continuous queries over massive data streams requires input streams to be split into parallel substreams. The query operators are continuously executed in parallel over these sub-streams. Stream splitting involves both partitioning and replication of incoming tuples, depending on how the continuous query is parallelized. We provide a stream splitting operator tha...
متن کاملScalable Parallelization of Expensive Continuous Queries over Massive Data Streams
Zeitler, E. 2011. Scalable Parallelization of Expensive Continuous Queries over Massive Data Streams. Acta Universitatis Upsaliensis. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 836. 35 pp. Uppsala. ISBN 978-91-554-8095-0. Numerous applications in for example science, engineering, and financial analysis increasingly require online analysis...
متن کاملAn Efficient, Scalable Content-Based Messaging System
Large-scale information processing environments must rapidly search through massive streams of raw data to locate useful information. These data streams contain textual and numeric data items, and may be highly structured or mostly freeform text. This project aims to create a high performance and scalable engine for locating relevant content in data streams. Based on the J2EE Java Messaging Ser...
متن کاملSAMOA: scalable advanced massive online analysis
samoa (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several...
متن کاملA Scalable Heterogeneous Solution for Massive Data Collection and Database Loading
Massive collection of data at high rates is critical for many industries. Typically, a massive stream of records is gathered from the business information network at a very high rate. Because of the complexity of the collection process, the classical database solution falls short. The high volume and rate of records involved requires a heterogeneous pipeline comprised of two major parts: a syst...
متن کامل