Randomized Algorithms for Data Reconciliation in Wide Area Aggregate Query Processing

نویسندگان

  • Fei Xu
  • Chris Jermaine
چکیده

Many aspects of the data integration problem have been considered in the literature: how to match schemas across different data sources, how to decide when different records refer to the same entity, how to efficiently perform the required entity resolution in a batch fashion, and so on. However, what has largely been ignored is a way to efficiently deploy these existing methods in a realistic, distributed enterprise integration environment. The straightforward use of existing methods often requires that all data be shipped to a coordinator for cleaning, which is often unacceptable. We develop a set of randomized algorithms that allow efficient application of existing entity resolution methods to the answering of aggregate queries over data that have been distributed across multiple sites. Using our methods, it is possible to efficiently generate aggregate query results that account for duplicate and inconsistent values scattered across a federated system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه روشی پویا جهت پاسخ به پرس‌وجوهای پیوسته تجمّعی اقتضایی

Data Streams are infinite, fast, time-stamp data elements which are received explosively. Generally, these elements need to be processed in an online, real-time way. So, algorithms to process data streams and answer queries on these streams are mostly one-pass. The execution of such algorithms has some challenges such as memory limitation, scheduling, and accuracy of answers. They will be more ...

متن کامل

Relational Databases Query Optimization using Hybrid Evolutionary Algorithm

Optimizing the database queries is one of hard research problems. Exhaustive search techniques like dynamic programming is suitable for queries with a few relations, but by increasing the number of relations in query, much use of memory and processing is needed, and the use of these methods is not suitable, so we have to use random and evolutionary methods. The use of evolutionary methods, beca...

متن کامل

Optimizing Aggregate Query Processing in Cloud Data Warehouses

In this paper, we study and optimize the aggregate query processing in a highly distributed Cloud Data Warehouse, where each database stores a subset of relational data in a star-schema. Existing aggregate query processing algorithms focus on optimizing various query operations but give less importance to communication cost overhead (Two-phase algorithm). However, in cloud architectures, the co...

متن کامل

Aggregate-Query Processing in Data Warehousing Environments

In this paper we introduce generalized projections (GP s), an extension of duplicateeliminating projections, that capture aggregations, groupbys, duplicate-eliminating projections (distinct), and duplicate-preserving projections in a common uni ed framework. Using GP s we extend well known and simple algorithms for SQL queries that use distinct projections to derive algorithms for queries using...

متن کامل

Analysis of Pre-processing and Post-processing Methods and Using Data Mining to Diagnose Heart Diseases

Today, a great deal of data is generated in the medical field. Acquiring useful knowledge from this raw data requires data processing and detection of meaningful patterns and this objective can be achieved through data mining. Using data mining to diagnose and prognose heart diseases has become one of the areas of interest for researchers in recent years. In this study, the literature on the ap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007