Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics
نویسندگان
چکیده
We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
منابع مشابه
Using CART to Generate Partially Synthetic, Public Use Microdata
To limit disclosure risks, one approach is to release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple imputations. This article presents and evaluates the use of classification and regression trees to generate p...
متن کاملWorking Paper M08/07 Methodology Estimating Risks Of Identification Disclosure In Partially Synthetic Data
To limit disclosures, statistical agencies and other data disseminators can release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple draws from statistical models. Because the original records are on the file, th...
متن کاملUsing Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results
The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at eve...
متن کاملEstimating Risks of Identification Disclosure in Partially Synthetic Data
To limit disclosures, statistical agencies and other data disseminators can release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple draws from statistical models. Because the original records are on the file, th...
متن کاملSigni cance tests for multi-component estimands from multiply imputed, synthetic microdata
To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units or...
متن کامل