Information Loss in Continuous Hybrid Microdata: Subdomain-Level Probabilistic Measures

نویسندگان

  • Josep Domingo-Ferrer
  • Josep Maria Mateo-Sanz
  • Francesc Sebé
چکیده

The goal of privacy protection in statistical databases is to balance the social right to know and the individual right to privacy. When microdata (i.e. data on individual respondents) are released, they should stay analytically useful but should be protected so that it cannot be decided whether a published record matches a specific individual. However, there is some uncertainty in the assessment of data utility, since the specific data uses of the released data cannot always be anticipated by the data protector. Also, there is uncertainty in assessing disclosure risk, because the data protector cannot foresee what will be the information context of potential intruders. Generating synthetic microdata is an alternative to the usual approach based on distorting the original data. The main advantage is that no original data are released, so no disclosure can happen. However, subdomains (i.e. subsets of records) of synthetic datasets do not resemble the corresponding subdomains of the original dataset. Hybrid microdata mixing original and synthetic microdata overcome this lack of analytical validity. We present a fast method for generating numerical hybrid microdata in a way that preserves attribute means, variances and covariances, as well as (to some extent) record similarity and subdomain analyses. We also overcome the uncertainty in assessing data utility by using newly defined probabilistic information loss measures.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing SDC Methods for Microdata on the Basis of Information Loss and Disclosure Risk

We present in this paper the first empirical comparison of SDC methods for microdata which encompasses both continuous and categorical microdata. Based on re-identification experiments, we try to optimize the tradeoff between information loss and disclosure risk. First, relevant SDC methods for continuous and categorical microdata are identified. Then generic information loss measures (not targ...

متن کامل

STATISTICAL COMMISSION and COMMISSION OF THE ECONOMIC COMMISSION FOR EUROPE EUROPEAN COMMUNITIES CONFERENCE OF EUROPEAN STATISTICIANS EUROSTAT

Abstract: We present in this paper the first empirical comparison of SDC methods for continuous microdata. Based on re-identification experiments, we try to optimize the tradeoff between information loss and disclosure risk. SDC methods compared include additive noise, distortion by probability distribution, microaggregation, resampling, rank swapping and the novel approach based on lossy compr...

متن کامل

LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection

In previous work by Domingo-Ferrer et al., rank swapping and multivariate microaggregation has been identified as well-performing masking methods for microdata protection. Recently, Dandekar et al. proposed using synthetic microdata, as an option, in place of original data by using Latin hypercube sampling (LHS) technique. The LHS method focuses on mimicking univariate as well as multivariate s...

متن کامل

An evolutionary approach to enhance data privacy

Dissemination of data with sensitive information about individuals has an implicit risk of unauthorized disclosure. Perturbative masking methods propose the distortion of the original data sets before publication, tackling a difficult tradeoff between data utility (low information loss) and protection against disclosure (low disclosure risk). In this paper we describe how information loss and d...

متن کامل

Preserving Edits When Perturbing Microdata for Statistical Disclosure Control Ntalie Shlomo, Ton De Waal

To protect individuals in microdata from the risk of re-identification, a general perturbative method called PRAM (the Post-Randomization Method) is sometimes used for masking records. This method adds “noise” to categorical variables by changing values of categories for a small number of records according to a prescribed probability matrix and a stochastic process based on the outcome of a ran...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005