High-Performance Processing of Continuous Uncertain Data
نویسندگان
چکیده
HIGH-PERFORMANCE PROCESSING OF CONTINUOUS UNCERTAIN DATA MAY 2013 THANH T. L. TRAN B.E., UNIVERSITY OF MELBOURNE M.S., UNIVERSITY OF MASSACHUSETTS AMHERST Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Yanlei Diao Uncertain data has arisen in a growing number of applications such as sensor networks, RFID systems, weather radar networks, and digital sky surveys. The fact that the raw data in these applications is often incomplete, imprecise and even misleading has two implications: (i) the raw data is not suitable for direct querying, (ii) feeding the uncertain data into existing systems produces results of unknown quality. This thesis presents a system for uncertain data processing that has two key functionalities, (i) capturing and transforming raw noisy data to rich queriable tuples that carry attributes needed for query processing with quantified uncertainty, and (ii) performing query processing on such tuples, which captures changes of uncertainty as data goes through various query operators. The proposed system considers data naturally captured by continuous distributions, which is prevalent in sensing and scientific applications. vi The first part of the thesis addresses data capture and transformation by proposing a probabilistic modeling and inference approach. Since this task is applicationspecific and requires domain knowledge, this approach is demonstrated for RFID data from mobile readers. More specifically, the proposed solution involves an inference and cleaning substrate to transform raw RFID data streams to object location tuple streams where locations are inferred from raw noisy data and their uncertain values are captured by probability distributions. The second, also the main part, of this thesis examines query processing for uncertain data modeled by continuous random variables. The proposed system includes new data models and algorithms for relational processing, with a focus on aggregation and conditioning operations. For operations of high complexity, optimizations including approximations with guaranteed error bounds are considered. Then complex queries involving a mix of operations are addressed by query planning, which given a query, finds an efficient plan that meets user-defined accuracy requirements. Besides relational processing, this thesis also provides the support for user-defined functions (UDFs) on uncertain data, which aims to compute the output distribution given uncertain input and a black-box UDF. The proposed solution employs a learning-based approach using Gaussian processes to compute approximate output with error bounds, and a suite of optimizations for high performance in online settings such as data stream processing and interactive data analysis. The techniques proposed in this thesis are thoroughly evaluated using both synthetic data with controlled properties and various real-world datasets from the domains of severe weather monitoring, object tracking using RFID readers, and computational astrophysics. The experimental results show that these techniques can yield high accuracy, meet stream speeds, and outperform existing techniques such as Monte Carlo sampling for many important workloads. vii TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiii CHAPTER
منابع مشابه
CLARO: Modeling and Processing of High-Volume Uncertain Data Streams
Uncertain data streams, where data is incomplete, imprecise, and even misleading, have been observed in a variety of environments. Feeding uncertain data streams to existing stream systems can produce results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the Claro system that supports uncertain data stream processing for data that is nat...
متن کاملOptimizing Probabilistic Query Processing on Continuous Uncertain Data
Uncertain data management is becoming increasingly important in many applications, in particular, in scientific databases and data stream systems. Uncertain data in these new environments is naturally modeled by continuous random variables. An important class of queries uses complex selection and join predicates and requires query answers to be returned if their existence probabilities pass a t...
متن کاملRobust H2 switching gain-scheduled controller design for switched uncertain LPV systems
In this article, a new approach is proposed to design robust switching gain-scheduled dynamic output feedback control for switched uncertain continuous-time linear parameter varying (LPV) systems. The proposed robust switching gain-scheduled controllers are robustly designed so that the stability and H2-gain performance of the switched closed-loop uncertain LPV system can be guaranteed even und...
متن کاملMixed Qualitative/Quantitative Dynamic Simulation of Processing Systems
In this article the methodology proposed by Li and Wang for mixed qualitative and quantitative modeling and simulation of temporal behavior of processing unit is reexamined and extended to more complex case. The main issue of their approach considers the multivariate statistics of principal component analysis (PCA), along with clustered fuzzy digraphs and reasoning. The PCA and fuz...
متن کاملThreshold Interval Indexing for Complicated Uncertain Data
Uncertain data is an increasingly prevalent topic in database research, given the advance of instruments which inherently generate uncertainty in their data. In particular, the problem of indexing uncertain data for range queries has received considerable attention. To efficiently process range queries, existing approaches mainly focus on reducing the number of disk I/Os. However, due to the in...
متن کامل