What can FCA do for database linkkey extraction? (problem paper)

نویسندگان

  • Manuel Atencia
  • Jérôme David
  • Jérôme Euzenat
چکیده

Links between heterogeneous data sets may be found by using a generalisation of keys in databases, called linkkeys, which apply across data sets. This paper considers the question of characterising such keys in terms of formal concept analysis. This question is natural because the space of candidate keys is an ordered structure obtained by reduction of the space of keys and that of data set partitions. Classical techniques for generating functional dependencies in formal concept analysis indeed apply for finding candidate keys. They can be adapted in order to find database candidate linkkeys. The question of their extensibility to the RDF context would be worth investigating. We aim at finding correspondences between properties of two RDF datasets which allows for identifying items denoting the same individuals. This is particularly useful when dealing with linked data [8] for finding equality links between data sets. Because the RDF setting raises many additional problems, we restrict ourselves here to databases. The problem is illustrated by the two (small) book relations of Table 1 (from [5], p.116). We would like to characterise a way to identify items on the same line while not (wrongly) identifying any other pair of items. bookstore relation library relation id firstname title lastname lang year author orig translator wid id fn tt ln lg y a o t w 1845 Poe Raven Baudelaire a1 1845 Poe Raven Mallarmé a2 3 E. Gold bug Poe en 1843 Poe Gold Bug Baudelaire b 4 T. On murder Quincey en 1827 Quincey On murder Schwob c 5 T. Kant Quincey en 1827 Quincey Kant Schwob d 6 T. Confessions Quincey en 1822 Quincey Confessions Musset e 7 J.-J. Confessions Rousseau fr 8 T. Confessions Aquinus fr Table 1. Two relations with, on the same lines, those tuples that represents the same individual (the line after attribute names are abbreviations). For that purpose, we have defined linkkeys [5, 2] and we would like to formulate the linkkey extraction problem in the framework of formal concept analysis [6]. We first present this problem in the context of database candidate key extraction where one looks for sets of attributes and the sets of equality statements that they generate. We formulate this problem as the computation of a concept lattice. Then we turn to an adaptation of linkkeys to databases and show that the previous technique cannot be used for extracting the expected linkkeys. Instead we propose an adaptation. 1 Candidate keys in databases A relation D = 〈A, T 〉 is a set of tuples T characterised by a set of attributes A. A key is a subset of the attributes whose values identify a unique tuple. Definition (key) Given a database relation D = 〈A, T 〉, a key is a subset of the attributes K ⊆ A, such that ∀t, t′ ∈ T , (∀p ∈ K, p(t) = p(t′))⇒ t ≈ t′. Classically, keys are defined from functional dependencies. A set of attributes A is functionally dependent from another K, if equality of the attributes of K determines equality for the attributes of A. If the equality between tuples is the same thing as the equality for all attribute values, then a key is simply those sets of attributes of which A is functionally dependent. However, we have not used the equality between tuple (=) but a particular ≈ relation. The reason is that we do not want to find keys for the database with =, but with an unknown relation ≈ which is to be discovered. The statements t ≈ t′ are those equality statements that are generated by the key. The ≈ relation must contain = (t = t′ ⇒ t ≈ t′) and be an equivalence relation (this is by definition if it is the smallest relation satisfying the key). From a key K of a relation 〈A, T 〉, it is easy to obtain these statements through the function γ : 2A → 2T ×T such that γ(K) = {t ≈ t′|∀p ∈ K, p(t) = p(t′)}. γ is anti-monotonic (∀K,K ′ ⊆ A,K ⊆ K ′ ⇒ γ(K) ⊇ γ(K ′)). We define candidate key extraction as the task of finding the minimal sets of attributes which generate a partition of the set of tuples. Definition (candidate key) Given a database relation D, a candidate key is a key such that none of its proper subsets generate the same partition. κ(D) is the set of candidate keys. Those candidate keys which generate the singletons(T ) partition are called normal candidate keys and their set noted κ̂(D) = {K ∈ κ(D)|∀(t ≈ t′) ∈ γ(K), t = t′}. The problem of candidate key extraction is formulated in the following way: Problem: Given a database relation D, find κ(D). This problem is usually not considered in databases. Either keys are given and used for finding equivalent tuples and reducing the table, or the table is assumed without redundancy and keys are extracted. In this latter case, the problem is the extraction of normal candidate keys. Using lattices is common place for extracting functional dependencies [9, 4] and the link to extract functional dependencies with formal concept analysis has already been considered [6] and further refined [10, 3]. In fact, this link can be fully exploited for extracting candidate keys instead of finding functional dependencies. It consists of defining1 a formal context enc(〈A, T 〉) = 〈P2(T ),A, I〉 such that: ∀p ∈ A,∀〈t, t′〉 ∈ P2(T ), 〈t, t′〉Ip iff p(t) = p(t′) The (formal) concepts of this encoding, that we denote by the set FCA(enc(D)), associate a set of attributes to a set of pairs of tuples. These pairs of tuples are tuples that cannot be distinguished by the values of the attributes, i.e., our≈ assertions. The candidate keys are the minimal elements of the intent which generate exactly the corresponding partition2. κ(D) = ⋃ c∈FCA(enc(D)) μ⊆{K ⊆ intent(c)|γ(K) = γ(intent(c))}. For any key K ∈ κ(D), γ(c) is the reflexive, transitive and symmetric closure of the extent of its concept. If this method is applied to the data sources of Table 1, the result is displayed in Figure 1.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FCA Software Interoperability

This paper discusses FCA software interoperability from a variety of angles: because the central FCA structures, formal contexts and concept lattices, can be represented in non-FCA software, interoperability with such software is of relevance. The non-FCA software in question is spreadsheet, relational database, graph and vector graphics software. The simplest approach to interoperability consi...

متن کامل

An FCA Classification of Durations of Time for Textual Databases

Formal Concept Analysis (FCA) is useful in many applications, not least in data analysis. In this paper, we apply the FCA approach to the problem of classifying sets of sets of durations of time, for the purposes of storing them in a database. The database system in question is, in fact, an object-oriented text database system, in which all objects are seen as arbitrary sets of integers. These ...

متن کامل

Towards an FCA-based Recommender System for Black-Box Optimization

Black-box optimization problems are of practical importance throughout science and engineering. Hundreds of algorithms and heuristics have been developed to solve them. However, none of them outperforms any other on all problems. The success of a particular heuristic is always relative to a class of problems. So far, these problem classes are elusive and it is not known what algorithm to use on...

متن کامل

Improving LNMF Performance of Facial Expression Recognition via Significant Parts Extraction using Shapley Value

Nonnegative Matrix Factorization (NMF) algorithms have been utilized in a wide range of real applications. NMF is done by several researchers to its part based representation property especially in the facial expression recognition problem. It decomposes a face image into its essential parts (e.g. nose, lips, etc.) but in all previous attempts, it is neglected that all features achieved by NMF ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014