Classification of Contradiction Patterns
نویسندگان
چکیده
Solving conflicts between overlapping databases requires an understanding of the reasons that lead to the inconsistencies. Provided that conflicts do not occur randomly but follow certain regularities, patterns in the form of ”If condition Then conflict” provide a valuable means to facilitate their understanding. In previous work, we adopt existing association rule mining algorithms to identify such patterns. Within this paper we discuss extensions to our initial approach aimed at identifying possible update operations that caused the conflicts between the databases. This is done by restricting the items used for pattern mining. We further propose a classification of patterns based on mappings between the contradicting values to represent special cases of conflict generating updates. 1 Conflicts in Overlapping Databases Many databases exist with overlaps in their sets of represented real-world entities. There are different reasons for these overlaps, like: • replication of data sources at different sites to improve the performance of web-services and the availability of the data, • independent production of data representing a common set of entities or individuals by different groups or institutions, and • data integration where data is copied from sources, possibly transformed and manipulated for data cleansing, and stored in an integrated data warehouse. Whenever overlapping data is administered at different sites, there is a high probability of the occurrence of differences. Many of these inconsistencies are systematic, caused by the usage of different controlled vocabularies, different measurement units, different data modifications for data cleansing, or by consistent bias in experimental data analysis. When producing a consistent view of the data knowledge about such systematic deviations can be used to assess the individual quality of database copies for conflict resolution. 2 Heiko Müller, Ulf Leser, and Johann-Christoph Freytag Assuming that conflicts do not occur randomly but follow specific (but unknown) regularities, patterns of the form ”If condition Then conflict” provide a valuable means to facilitate the identification and understanding of systematic deviations. In Müller et al. (2004) we proposed the adaptation of existing data mining algorithms to find such contradiction patterns. Evaluated by a domain expert, these patterns can be utilized to assess the correctness of conflicting values and therefore for conflict resolution. Within this paper we present a modified approach for mining contradictory data aimed at enhancing the expressiveness of the identified patterns. This approach is based on the assumption that conflicts result from modification of databases that initially where equal (see Figure 1). Conflicting values are introduced by applying different sequences of update operations, representing for example different data cleansing activities, to a common ancestor database. Given a pair of contradicting databases, each resulting from a different update sequence, we reproduce conflict generation to assist a domain expert in conflict resolution. We present an algorithm for identifying update operations that describe in retrospective the emergence of conflicts between the databases. These conflict generators are a special class of contradiction patterns. We further classify the patterns based on the mapping between contradicting values that they define. Each such class corresponds to a special type of conflict generating update operation. The classification further enhances the ability for pruning irrelevant patterns in the algorithm. The reminder of this paper is structured as follows: In Section 2 we define conflict generators for databases pairs. Section 3 presents an algorithm for finding such conflict generators. We discuss related work in Section 4 and conclude in Section 5. 2 Reproducing Conflict Generation Databases r1 and r2 from Figure 1 contain fictitious results of different research groups investigating a common set of individual owls. Identification of tuples representing the same individual is accomplished by the unique object identifier ID. The problem of assigning these object identifiers is not considered within this paper, i.e., we assume a preceding duplicate detection step (see for example Hernandez and Stolfo (1995)). Note that we are only interested in finding update operations that introduce conflicts between the overlapping parts of databases. Therefore, we also assume that all databases have equal sets of object identifiers. Conflicting values are highlighted in Figure 1 and conflicts are systematic. The conflicts in attribute Species are caused by the different usage of English and Latin vocabularies to denote species names, conflicts in attribute Color are due to a finer grained color description for male and female snowy owls (Nyctea Scandica) in database r2, and the conflicts within attribute Size are caused by rounding or truncation errors for different species in database r1. Classification of Contradiction Patterns 3 Fig. 1. A model for conflict emergence in overlapping databases Reproducing conflict generation requires the identification of possible predecessors of the given databases. We consider exactly one predecessor for each of the databases r1 and r2 and each non-key attribute A. A predecessor is used to describe conflict generation within attribute A by identifying update operations that modify the predecessor resulting in conflicts between r1 and r2. Figure 1 shows the predecessor rp for database r1 and attribute Color. Also shown is an update operation that describes the generation of conflicts by replacing the finer grained original color specifications in rp with the more generic term ’white’ in r1. Currently, we only consider update operations that modify the values within one attribute. We start by defining the considered predecessors and then define a representation for conflict generators. 2.1 Preceding Databases The databases within this paper are relational databases consisting of a single relation r, following the same schema R(A1, . . . , An). The domain of each attribute A ∈ R is denoted by dom(A). Database tuples are denoted by t and attribute values of a tuple are denoted by t[A]. There exists a primary key PK ∈ R for object identification. The primary key attribute is excluded from any modification. Given a pair of databases r1 and r2, the set of potential predecessors is infinite. We restrict this set by allowing the values in the common ancestor to 4 Heiko Müller, Ulf Leser, and Johann-Christoph Freytag be modified at most once by any conflict generator. This restriction enables the definition of exactly one predecessor for each of the databases and each non-key attribute. In the reminder of this paper we consider only conflicts within a fixed attribute B ∈ R/{PK}. Let rp be the predecessor for database r1. Database rp equals r1 in all attributes that are different from B. These values will not be affected by an update operation modifying attribute B. The values for rp in attribute B are equal to the corresponding values in the contradicting database r2. These are the values that are changed to generate conflicts between r1 and r2: rp = {t | t ∈ dom(A1)× · · · × dom(An) ∧ ∃ t1 ∈ r1, t2 ∈ r2 : ∀A ∈ R : t[A] = { t1[A], if A 6= B t2[A], else } 2.2 Conflict Generators A conflict generator is a (condition, action)-pair, where the condition defines the tuples that are modified and the action describes the modification itself. Conditions are represented by closed patterns as defined in the following. The action is reflected by the mapping of values between predecessor and resulting database in the modified attribute. For example, the action of the update operation shown in Figure 1 results in a mapping of values white & grey and snow-white to value white. We classify conflict generators based on the properties of the mapping they define. Tuples are represented using terms τ : (A, x) that are (attribute, value)pairs, with A ∈ R and x ∈ dom(A). Let terms(t) denote the set of terms for a tuple t. For each attribute A ∈ R there exists a term (A, t[A]) ∈ terms(t). A pattern ρ is a set of terms, i.e., ρ ⊆ ⋃ t∈r terms(t). A tuple t satisfies a pattern ρ if ρ ⊆ terms(t). The empty pattern is satisfied by any pattern. The set of tuples from r that satisfy ρ is denoted by ρ(r). We call |ρ(r)| the support of the pattern. A pattern ρ is called a closed pattern if there does not exists a superset ρ ⊂ ρ with ρ(r) = ρ(r). We focus solely on closed patterns as conditions for conflict generators. The set of closed patterns is smaller in size than the set of patterns. Still, closed patterns are lossless in the sense that they uniquely determine the set of all patterns and their set of satisfied tuples (Zaki (2002)). The Boolean function conflictB(t) indicates for each tuple tp ∈ rp whether contradicting values exist for attribute B in the corresponding tuples t1 ∈ r1 and t2 ∈ r2 with t1[PK] = t2[PK] = tp[PK]:
منابع مشابه
Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment
In this paper we propose a two-stage method to acquire contradiction relations between typed lexico-syntactic patterns such as Xdrug prevents Ydisease and Ydisease caused by Xdrug . In the first stage, we train an SVM classifier to detect contradiction pattern pairs in a large web archive by exploiting the excitation polarity (Hashimoto et al., 2012) of the patterns. In the second stage, we enl...
متن کاملA Novel Fault Detection and Classification Approach in Transmission Lines Based on Statistical Patterns
Symmetrical nature of mean of electrical signals during normal operating conditions is used in the fault detection task for dependable, robust, and simple fault detector implementation is presented in this work. Every fourth cycle of the instantaneous current signal, the mean is computed and carried into the next cycle to discover nonlinearities in the signal. A fault detection task is complete...
متن کاملIdentification of Fraud in Banking Data and Financial Institutions Using Classification Algorithms
In recent years, due to the expansion of financial institutions,as well as the popularity of the World Wide Weband e-commerce, a significant increase in the volume offinancial transactions observed. In addition to the increasein turnover, a huge increase in the number of fraud by user’sabnormality is resulting in billions of dollars in lossesover the world. T...
متن کاملAssessing Behavioral Patterns of Motorcyclists Based on Traffic Control Device at City Intersections by Classification Tree Algorithm
According to the forensic statistics, in Iran, 26 percent of those killed in traffic accidents are motorcyclists in recent years. Thus, it is necessary to investigate the causes of motorcycle accidents because of the high number of motorcyclist casualties. Motorcyclists' dangerous behaviors are among the causes of events that are discussed in this study. Traffic signs have the important role of...
متن کاملA research on classification performance of fuzzy classifiers based on fuzzy set theory
Due to the complexities of objects and the vagueness of the human mind, it has attracted considerable attention from researchers studying fuzzy classification algorithms. In this paper, we propose a concept of fuzzy relative entropy to measure the divergence between two fuzzy sets. Applying fuzzy relative entropy, we prove the conclusion that patterns with high fuzziness are close to the classi...
متن کاملIdentification of Fraud in Banking Data and Financial Institutions Using Classification Algorithms
In recent years, due to the expansion of financial institutions,as well as the popularity of the World Wide Weband e-commerce, a significant increase in the volume offinancial transactions observed. In addition to the increasein turnover, a huge increase in the number of fraud by user’sabnormality is resulting in billions of dollars in lossesover the world. T...
متن کامل