A Software Infrastructure for the CLEENEX Optimizer
نویسندگان
چکیده
The problems associated to data quality is an increasingly growing concern. Throughout this document we will focus on a specific data quality problem: the existence of approximate duplicate records. Data cleaning aims at correcting data quality problems that can be found in various situations. There are some data cleaning tools that address these data quality problems. One of the tasks of a data cleaning program consists in the approximate duplicate detection. The approximate duplicate detection must be efficient, because if we are dealing with a large amount of data, comparing all the records will result in a performance bottleneck. The goal of the optimizer in a data cleaning tool is to build several execution plans for the data cleaning program and, based on the cost of each execution plan, choose the most efficient. In order to have the optimizer, we need to build a software infrastructure to support it. In particular, this infrastructure must provide several alternatives that improve the efficiency of the approximate duplicate detection. In this thesis, we designed and implemented an infrastructure to support an optimizer for CLEENEX, a data cleaning tool. In this document we also describe the validation methodology regarding the implemented infrastructure.
منابع مشابه
Experience in Testing Compiler Optimizers Using Comparison Checking
This paper describes our experience of testing and debugging an optimizer using comparison checking. Although this study is based on Jaramillo et al.’s work, the experience will help those who test optimizers using this technique. In our implementation, important values during the execution of programs are output as a file trace before and after each optimization. Then a comparison phase checks...
متن کاملA Status Report on XXL - a Software Infrastructure for Efficient Query Processing
XXL is a Java library that contains a rich infrastructure for implementing advanced query processing functionality. The library offers low-level components like access to raw disks as well as high-level ones like a query optimizer. On the intermediate levels, XXL provides a demand-driven cursor algebra, a framework for indexing and a powerful package for supporting aggregation. The library is p...
متن کاملHealthcare Districting Optimization Using Gray Wolf Optimizer and Ant Lion Optimizer Algorithms (case study: South Khorasan Healthcare System in Iran)
In this paper, the problem of population districting in the health system of South Khorasan province has been investigated in the form of an optimization problem. Now that the districting problem is considered as a strategic matter, it is vital to obtain efficient solutions in order to implement in the system. Therefore in this study two meta-heuristic algorithms, Ant Lion Optimizer (ALO) and G...
متن کاملStrata: A Software Dynamic Translation Infrastructure
Software dynamic translation is the alteration of a running program to achieve a specific objective. For example, a dynamic optimizer uses software dynamic translation to modify a running program with the objective of making the program run faster. In addition to its demonstrated utility in dynamic optimizers, software dynamic translation also shows promise for producing applications that are a...
متن کاملMAO - An extensible micro-architectural optimizer
Performance matters, and so does repeatability and predictability. Today’s processors’ micro-architectures have become so complex as to now contain many undocumented, not understood, and even puzzling performance cliffs. Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and...
متن کامل