SemGen - Towards a Semantic Data Generator for Benchmarking Duplicate Detectors

نویسندگان

  • Wolfgang Gottesheim
  • Stefan Mitsch
  • Werner Retschitzegger
  • Wieland Schwinger
  • Norbert Baumgartner
چکیده

Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability. In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generating Synthetic RDF Data with Connected Blank Nodes for Benchmarking

Generators for synthetic RDF datasets are very important for testing and benchmarking various semantic data management tasks (e.g. querying, storage, update, compare, integrate). However, the current generators do not support sufficiently (or totally ignore) blank node connectivity issues. Blank nodes are used for various purposes (e.g. for describing complex attributes), and a significant perc...

متن کامل

Pushing the Limits of Instance Matching Systems: A Semantics-Aware Benchmark for Linked Data

The architectural choices behind the Data Web have led to the publication of large interrelated data sets that contain different descriptions for the same real-world objects. Due to the mere size of current online datasets, such duplicate instances are most commonly detected (semi-)automatically using instance matching frameworks. Choosing the right framework for this purpose remains tedious, a...

متن کامل

Clone Detection by Comparing Abstract Memory States

In this paper, we propose a new semantic clone detection technique by comparing programs’ abstract memory states, which are computed by a semantic-based static analyzer. Our experimental study using three large-scale open source projects shows that our technique can detect semantic clones that existing syntacticor semantic-based clone detectors miss. Our technique can help developers identify i...

متن کامل

Benchmarking RDF Query Engines: The LDBC Semantic Publishing Benchmark

The Linked Data paradigm which is now the prominent enabler for sharing huge volumes of data by means of Semantic Web technologies, has created novel challenges for non-relational data management technologies such as RDF and graph database systems. Benchmarking, which is an important factor in the development of research on RDF and graph data management technologies, must address these challeng...

متن کامل

Towards constructing an Integrative, Multi-Level Model for Cognition: The Function of Semantic Networks

Integrated approaches try to connect different constructs in different theories and reinterpret them using a common conceptual framework. In this research, using the concept of processing levels, an integrated, three-level model of the cognitive systems has been proposed and evaluated. Processing levels are divided into three categories of Feature-Oriented, Semantic and Conceptual Level based o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011