Arnold: Declarative Crowd-Machine Data Integration

نویسندگان

  • Shawn R. Jeffery
  • Liwen Sun
  • Matt DeLand
  • Nick Pendar
  • Rick Barber
  • Andrew Galdi
چکیده

The availability of rich data from sources such as the World Wide Web, social media, and sensor streams is giving rise to a range of applications that rely on a clean, consistent, and integrated database built over these sources. Human input, or crowd-sourcing, is an effective tool to help produce such high-quality data. It is infeasible, however, to involve humans at every step of the data cleaning process for all data. We have developed a declarative approach to data cleaning and integration that balances when and where to apply crowd-sourcing and machine computation using a new type of data independence that we term Labor Independence. Labor Independence divides the logical operations that should be performed on each record from the physical implementations of those operations. Using this layer of independence, the data cleaning process can choose the physical operator for each logical operation that yields the highest quality for the lowest cost. We introduce Arnold, a data cleaning and integration architecture that utilizes Labor Independence to efficiently clean and integrate large amounts of data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Human-in-the-loop Data Integration

Data integration aims to integrate data in different sources and provide users with a unified view. However, data integration cannot be completely addressed by purely automated methods. We propose a hybrid human-machine data integration framework that harnesses human ability to address this problem, and apply it initially to the problem of entity matching. The framework first uses rule-based al...

متن کامل

December: A Declarative Tool for Crowd Member Selection

Adequate crowd selection is an important factor in the success of crowdsourcing platforms, increasing the quality and relevance of crowd answers and their performance in different tasks. The optimal crowd selection can greatly vary depending on properties of the crowd and of the task. To this end, we present December, a declarative platform with novel capabilities for flexible crowd selection. ...

متن کامل

SWISH: An Integrated Semantic Web Notebook

SPARQL editors like Yasgui [6] make it easier to write and inspect their results. Notebooks like Jupyter/IPython [5] already support computerand data scientists in domains like statistics and machine learning. There is currently not an integrated notebook solution for Semantic Web programming that combines the strengths of SPARQL editors with the benefits of notebooks. The challenge is that Sem...

متن کامل

The JUMP-machine: A Generic Basis for the Integration of Declarative Paradigms

Implementation techniques for functional languages on the one hand and for logic languages on the hand diier considerably. This complicates the development of exible and eecient abstract machines and thus good compilers for multi-paradigm languages. The JUMP-machine is an abstract machine speciically designed for the eecient implementation of multi-paradigm languages. It consists of a minimal c...

متن کامل

Game Aspect: An Approach to Separation of Concerns in Crowdsourced Data Management

In data-centric crowdsourcing, it is well known that the incentive structure connected to workers’ behavior greatly affects output data. This paper proposes to use a declarative language to deal with both of data computation and the incentive structure explicitly. In the language, computation is modeled as a set of Datalog-like rules, and the incentive structures for the crowd are modeled as ga...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013