Sampling with Synthesis: A New Approach for Releasing Public Use Census Microdata
نویسندگان
چکیده
Many statistical agencies disseminate samples of census microdata, i.e., data on individual records, to the public. Before releasing the microdata, agencies typically alter identifying or sensitive values to protect data subjects’ confidentiality, for example by coarsening, perturbing, or swapping data. These standard disclosure limitation techniques distort relationships and distributional features in the original data, especially when applied with high intensity. Furthermore, it can be difficult for analysts of the masked public use data to adjust inferences for the effects of the disclosure limitation. Motivated by these shortcomings, we propose an approach to census microdata dissemination called sampling with synthesis. The basic idea is to replace the identifying or sensitive values in the census with multiple imputations, and release samples from these multiply-imputed populations. We demonstrate that sampling with synthesis can improve the quality of public use data relative to sampling followed ∗Jörg Drechsler is research scientist, Institute for Employment Research, Department for Statistical Methods, Regensburger Straße 104, 90478 Nürnberg, Germany,(e-mail: [email protected]); and Jerome P. Reiter is Mrs. Alexander F. Hehmeyer Associate Professor of Statistical Science, Duke University, Durham, NC 27708-0251, (e-mail: [email protected]). This research was supported by a grant from the National Science Foundation (NSF-MMS-0751671).
منابع مشابه
Bayesian Multiple Imputation for Large-Scale Categorical Data with Structural Zeros
We propose an approach for multiple imputation of items missing at random in large-scale surveys with exclusively categorical variables that have structural zeros. Our approach is to use mixtures of multinomial distributions as imputation engines, accounting for structural zeros by conceiving of the observed data as a truncated sample from a hypothetical population without structural zeros. Thi...
متن کاملA Measure of Disclosure Risk for Microdata
Protection against disclosure is important for statistical agencies releasing microdata files from sample surveys. Estimates of simple measures of disclosure risk can provide useful evidence to support decisions about release. We propose a new measure of disclosure risk: the probability that a unique match between a microdata record and a population unit is correct. We argue that this measure h...
متن کاملSYNTHETIC DATA FOR SMALL AREA ESTIMATION IN THE AMERICAN COMMUNITY SURVEY by
Small area estimates provide a critical source of information used to study local populations. Statistical agencies regularly collect data from small areas but are prevented from releasing detailed geographical identifiers in public-use data sets due to disclosure concerns. Alternative data dissemination methods used in practice include releasing summary/aggregate tables, suppressing detailed g...
متن کاملSigni cance tests for multi-component estimands from multiply imputed, synthetic microdata
To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units or...
متن کاملWhen Excessive Perturbation Goes Wrong and Why IPUMS-International Relies Instead on Sampling, Suppression, Swapping, and Other Minimally Harmful Methods to Protect Privacy of Census Microdata
IPUMS-International disseminates population census microdata at no cost for 69 countries. Currently, a series of 212 samples totaling almost a half billion person records are available to researchers. Registration is required for researchers to gain access to the microdata. Statistics from Google Analytics show that IPUMS-International's lengthy, probing registration form is an effective deterr...
متن کامل