SLINT+ results for OAEI 2013 instance matching
نویسندگان
چکیده
The goal of instance matching is to detect identity resources, which refer to the same real-world object. In this paper, we introduce SLINT+, a novel interlinking system. SLINT+ detects all identity linked data resources between two given repositories. SLINT+ does not require the specifications of RDF predicates and labeled matching resources. SLINT+ performs competitively at OAEI instance matching campaign this year. 1 Presentation of the system The problem of detecting instances co-referring to the same real-world object is positively important in data integration. It is useful for reducing the heterogeneity and warranting the consistency of data. The asynchronous development of linked data increasingly requires the specific linked data instance matching algorithms to connect existing instances and newly added instances. In linked data, the differences of data representation appear not only in object values, but also in RDF predicates, which specify the meaning of objects properties. This issue is popularly found in different data repositories or even the same repository but different domains. Also, it separates the current solutions into two groups: schemadependent and schema-independent. The first approach uses the descriptions of RDF predicate as the guide for matching while the second approach does not. In addition, the schema-independent approach consists of two minor branches: supervised learning and unsupervised learning, which involve with using or not using the labeled data of identity instances. We develop SLINT [3] and its extension SLINT+ [2]. These systems are schemaindependent and training-free. SLINT+ is used for instance matching track of OAEI 2013. 1.1 State, purpose, general statement SLINT+ is a flexible schema-independent linked data interlinking system. SLINT+ can interlink various data sources, and is independent on the schema of data sources. By detecting appropriate predicate alignments without supervised learning, SLINT+ does not require expensive curation on data examples. The principle of SLINT+ is similar to previous data interlinking systems. There are two main phases in the interlinking process of SLINT+: candidate generation and Fig. 1. The interlinking process of SLINT+ instance matching. The first phase separates similar instances into different groups in order to reduce the number of the pending pairs. The second phase will determine which candidate is really identity. With the schema-independent goal, we add two steps into SLINT+: predicate selection and predicate alignment. The mission of these new steps is to find the predicate alignments specifying the same properties of instances. 1.2 Specific techniques used The architecture of SLINT+ is depicted in Fig.1. DS and DT are the source and target data. The predicate selection step finds the important predicate by some statistical measures based on RDF objects involving with each predicate. The predicate alignment step matches important predicates and selects the reasonable alignments. This step can be recognized as an instance-based ontology matching task. The candidate generation step picks up similar instances, which are predicted to be identity. The final step, instance matching, compares suggested candidates and produces the interlinking result. In the following sections, we describe the details of each step in order of the process. 1.2.1 Predicate selection This is the first step of the interlinking process. It collects the important predicates of each input data sources. Important predicates are expected to be used by a large portion of instances and stored specific information of each instance. Thus, an important predicate should have high frequency and diver RDF objects. We use coverage and discriminability as the metrics to evaluate the importance level of each predicate. These metrics are the extensions from [4]. Equation (1) and (2) in turn are the formulas of coverage Cov(pk) of predicate pk and its discriminability Dis(pk). The notation < s, p, o > stands for subject, predicate, and object of a RDF triple. x, D, and f are the instance, data source, and frequency of RDF object, respectively. Cov(pk) = |{x|x ∈ D,∃t =< s, pk, o >∈ x}| |D| (1) Dis(pk) = V ar(pk)×H(pk) V ar(pk) +H(pk) V ar(pk) = |Opk | |{t|∃x ∈ D, t =< s, pk, o >∈ x}| H(pk) =− ∑ oi∈Opk f(oi) ∑ oj∈Opk f(oj) × log f(oi) ∑ oj∈Opk f(oj) Opk = {o|∃x ∈ D, t =< s, pk, o >∈ x} (2) A predicate is important if its coverage, discriminability, and the harmonic means of them are greater than given thresholds α, β, and γ, respectively. We select two sets of important predicates, from two input data sources. In the next step, we align these sets and find the useful predicate alignments. 1.2.2 Predicate alignment In this step, we firstly group the predicates by their type. The type of a predicate is determined by the dominant type of its RDF objects. There are five predicate types used in SLINT+: string, URI, double, integer, and date. Secondly, we combine the type-similar predicates of source and target data to get raw predicate alignments. Confidence is the evidence for evaluating the usefulness of raw alignments. The confidence is estimated using the intersection of all RDF objects described by the predicates of each alignment. Equation (3) describes the confidence of the alignment between predicates pi and pj . conf(pi, pj) = |R(Opi) ∩R(Opj )| |R(Opi) ∪R(Opj )| Opi = {o|∃x ∈ DS , < s, pi, o >∈ x} Opj = {o|∃x ∈ DT , < s, pj , o >∈ x} (3) By using functionR, the string, URI, and double are compared indirectly. For string and URI, R collects lexical words from given texts and links. For double, R rounds the values into two decimal points precision. For the remaining types, R uses the original values without transformation. Only useful alignments whose confidence is greater than a threshold will be kept for the next steps. This threshold is computed by averaging the confidence of nontrivial alignments. An alignment is considered as non-trivial if its confidence is higher than threshold , a small value. In the next steps, useful alignments will be used as the specification for comparing instances. 1.2.3 Candidate generation The goal of candidate generation is to limit the number of instances to be compared. SLINT+ performs a very fast comparison for each pair of instances. The result of this comparison is a rough similarity between instances. It is consolidated from their shared RDF objects, without any consideration for each predicate alignment. That is, two compared RDF objects can associate with two predicates having no alignments selected. Equation (4) is the rough similarity of instances xS and xT . In this equation, A is the set of filtered predicate alignments; R is the preprocessing procedure, as used in predicate alignment step; w(O,S) and w(O, T ) are the weight of shared value O in each data source DS and DT , respectively. The weight of string and URI values is estimated by the TF-IDF score while that of remaining types is fixed to 1.0.
منابع مشابه
OAEI 2016 results of AML
AgreementMakerLight (AML) is an automated ontology matching system based primarily on element-level matching and on the use of external resources as background knowledge. This paper describes its configuration for the OAEI 2016 competition and discusses its results. For this OAEI edition, we tackled instance matching for the first time, thus expanding the coverage of AML to all types of ontolog...
متن کاملResults of AML in OAEI 2017
AgreementMakerLight (AML) is an automated ontology matching system that was developed with both extensibility and efficiency in mind. This paper describes its configuration for the OAEI 2017 competition and discusses its results. For this OAEI edition, we built upon the instance matching foundations we laid last year, and tackled the new Hobbit track and its new evaluation platform. AML was the...
متن کاملLarge scale instance matching via multiple indexes and candidate selection
Instance Matching aims to discover the linkage between different descriptions of real objects across heterogeneous data sources. With the rapid development of Semantic Web, especially of the linked data, automatically instance matching has been become the fundamental issue for ontological data sharing and integration. Instances in the ontologies are often in large scale, which contains millions...
متن کاملInsMT+ results for OAEI 2015 instance matching
The InsMT+ is an improved version of InsMT system participated at OAEI 2014. The InsMT+ an automatic instance matching system which consists in identifying the instances that describe the same real-world objects. The InsMT+ applies different string-based matchers with a local filter. This is the second participation of our system and we have improved somehow the results obtained by the previous...
متن کاملSTRIM results for OAEI 2015 instance matching evaluation
The interest of instance matching grows everyday with the emergence of linked data. This task is very necessary to interlink semantically data together in order to be reused and shared. In this paper, we introduce STRIM, an automatic instance matching tool designed to identify the instances that describe the same real-world objects. The STRIM system participates for the first time at OAEI 2015 ...
متن کامل