Private Exploration Primitives for Data Cleaning
نویسندگان
چکیده
Data cleaning is the process of detecting and repairing inaccurate or corrupt records in the data. Data cleaning is inherently human-driven and state of the art systems assume cleaning experts can access the data to tune the cleaning process. However, in sensitive datasets, like electronic medical records, privacy constraints disallow unfettered access to the data. To address this challenge, we propose an utility-aware differentially private framework which allows data cleaner to query on the private data for a given cleaning task, while the data owner can track privacy loss over these queries. In this paper, we first identify a set of primitives based on counting queries for general data cleaning tasks and show that even with some errors, these cleaning tasks can be completed with reasonably good quality. We also design a privacy engine which translates the accuracy requirement per query specified by data cleaner to a differential privacy loss parameter and ensures all queries are answered under differential privacy. With extensive experiments using blocking and matching as examples, we demonstrate that our approach is able to achieve plausible cleaning quality and outperforms prior approaches to cleaning private data.
منابع مشابه
Towards a Domain Independent Platform for Data Cleaning
We present a domain independent platform for data cleaning developed as part of the Data Cleaning project at Microsoft Research. Our platform consists of a set of core primitives and design tools that allow a programmer to develop sophisticated data cleaning solutions with minimal programming effort. Our primitives are designed to allow rich domain and application specific customizations and ca...
متن کاملDeclarative Cleaning, Analysis, and Querying of Graph-structured Data
Title of dissertation: DECLARATIVE CLEANING, ANALYSIS, AND QUERYING OF GRAPH-STRUCTURED DATA Walaa Eldin Moustafa, Doctor of Philosophy, 2013 Dissertation directed by: Professor Amol Deshpande, Professor Lise Getoor, Department of Computer Science Much of today’s data including social, biological, sensor, computer, and transportation network data is naturally modeled and represented by graphs. ...
متن کاملAn Exploration of Teachers' Beliefs about the Role of Grammar in Iranian High Schools and Private Language Institutes
This study was an attempt to explore the beliefs of Iranian EFL teachers about the role of grammar in English language teaching in both state schools and private language institutes. Data were collected through a questionnaire developed by Burgess and Etherington (2002), which consisted of 11 main subscales and was divided into two sections. The first section dealt with approaches to grammar te...
متن کاملParleda: a Library for Parallel Processing in Computational Geometry Applications
ParLeda is a software library that provides the basic primitives needed for parallel implementation of computational geometry applications. It can also be used in implementing a parallel application that uses geometric data structures. The parallel model that we use is based on a new heterogeneous parallel model named HBSP, which is based on BSP and is introduced here. ParLeda uses two main lib...
متن کاملLightweight 4x4 MDS Matrices for Hardware-Oriented Cryptographic Primitives
Linear diffusion layer is an important part of lightweight block ciphers and hash functions. This paper presents an efficient class of lightweight 4x4 MDS matrices such that the implementation cost of them and their corresponding inverses are equal. The main target of the paper is hardware oriented cryptographic primitives and the implementation cost is measured in terms of the required number ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1712.10266 شماره
صفحات -
تاریخ انتشار 2017