An Approach to Reducing Information Loss and Achieving Diversity of Sensitive Attributes in k-anonymity Methods
نویسندگان
چکیده
Electronic Health Records (EHRs) enable the sharing of patients' medical data. Since EHRs include patients' private data, access by researchers is restricted. Therefore k-anonymity is necessary to keep patients' private data safe without damaging useful medical information. However, k-anonymity cannot prevent sensitive attribute disclosure. An alternative, l-diversity, has been proposed as a solution to this problem and is defined as: each Q-block (ie, each set of rows corresponding to the same value for identifiers) contains at least l well-represented values for each sensitive attribute. While l-diversity protects against sensitive attribute disclosure, it is limited in that it focuses only on diversifying sensitive attributes. The aim of the study is to develop a k-anonymity method that not only minimizes information loss but also achieves diversity of the sensitive attribute. This paper proposes a new privacy protection method that uses conditional entropy and mutual information. This method considers both information loss as well as diversity of sensitive attributes. Conditional entropy can measure the information loss by generalization, and mutual information is used to achieve the diversity of sensitive attributes. This method can offer appropriate Q-blocks for generalization. We used the adult database from the UCI Machine Learning Repository and found that the proposed method can greatly reduce information loss compared with a recent l-diversity study. It can also achieve the diversity of sensitive attributes by counting the number of Q-blocks that have leaks of diversity. This study provides a privacy protection method that can improve data utility and protect against sensitive attribute disclosure. The method is viable and should be of interest for further privacy protection in EHR applications.
منابع مشابه
A Privacy Protection Model for Patient Data with Multiple Sensitive Attributes
The identity of patients must be protected when patient data are shared. The two most commonly used models to protect identity of patients are L-diversity and K-anonymity. However, existing work mainly considers data sets with a single sensitive attribute, while patient data often contain multiple sensitive attributes (e.g., diagnosis and treatment). This article shows that although the K-anony...
متن کاملAnatomisation with slicing: a new privacy preservation approach for multiple sensitive attributes
An enormous quantity of personal health information is available in recent decades and tampering of any part of this information imposes a great risk to the health care field. Existing anonymization methods are only apt for single sensitive and low dimensional data to keep up with privacy specifically like generalization and bucketization. In this paper, an anonymization technique is proposed t...
متن کاملResolving the Complexity of Some Data Privacy Problems
We formally study two methods for data sanitation that have been used extensively in the database community: k-anonymity and l-diversity. We settle several open problems concerning the difficulty of applying these methods optimally, proving both positive and negative results: – 2-anonymity is in P. – The problem of partitioning the edges of a triangle-free graph into 4-stars (degree-three verti...
متن کاملA Clustering Approach for Achieving Data Privacy
New privacy regulations together with everincreasing data availability and computational power have created a huge interest in data privacy research. One major research direction is built around k-anonymity property and its extensions, which are required for the released data. In this paper we present such an extension to k-anonymity, called psensitive k-anonymity, which solves some of the weak...
متن کاملP-Sensitive K-Anonymity with Generalization Constraints
Numerous privacy models based on the k‐anonymity property and extending the k‐anonymity model have been introduced in the last few years in data privacy re‐ search: l‐diversity, p‐sensitive k‐anonymity, (α, k) – anonymity, t‐closeness, etc. While differing in their methods and quality of their results, they all focus first on masking the data, and then protecting the quality of the data as a wh...
متن کامل