Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss
نویسندگان
چکیده
We address the problem of publishing a Naı̈ve Bayesian Classifier (NBC) or, equivalently, publishing the necessary views for building an NBC, while protecting privacy of the individuals who provided the training data. Our approach completely preserves the accuracy of the original classifier, and thus significantly improves on current approaches, such as randomization or anonymization, which typically degrade accuracy to preserve privacy. Current query-view security checkers address the question of ‘Is the view safe to publish?’ and are computationally expensive (often Πp2-complete). Here instead, we tackle the question of ‘How to make a view safe to publish?’ and propose a linear-time algorithm to publish safe NBCenabling views. We first show that a simple measure that restricts the ratios between the published NBC statistics is sufficient to prevent any breach of privacy. Then, we propose a linear-time algorithm to enforce this measure by producing perturbed statistics that assure both (i) individuals’ privacy, and (ii) a classifier that behaves in the same way as the NBC trained on the original data. By carefully expressing the derived statistics using rational numbers, we can easily produce synthetic (sanitized) datasets. Thus, for any given dataset, we produce another dataset that is secure to publish (w.r.t. a uniform prior) and achieves the same classification accuracy. Finally, we extend our results by providing sufficient conditions to cope with arbitrary (non-uniform prior) distributions, and we validate their effectiveness in practice through experiments on real-world data.
منابع مشابه
A New Hybrid Framework for Filter based Feature Selection using Information Gain and Symmetric Uncertainty (TECHNICAL NOTE)
Feature selection is a pre-processing technique used for eliminating the irrelevant and redundant features which results in enhancing the performance of the classifiers. When a dataset contains more irrelevant and redundant features, it fails to increase the accuracy and also reduces the performance of the classifiers. To avoid them, this paper presents a new hybrid feature selection method usi...
متن کاملSupervised Classification with Gaussian Networks. Filter and Wrapper Approaches
Bayesian network based classifiers are only able to handle discrete variables. They assume that variables are sampled from a multinomial distribution and most real-world domains involves continuous variables. A common practice to deal with continuous variables is to discretize them, with a subsequent loss of information. The continuous classifiers presented in this paper are supported by the Ga...
متن کاملSupervised classification with conditional Gaussian networks: Increasing the structure complexity from naive Bayes
Most of the Bayesian network-based classifiers are usually only able to handle discrete variables. However, most real-world domains involve continuous variables. A common practice to deal with continuous variables is to discretize them, with a subsequent loss of information. This work shows how discrete classifier induction algorithms can be adapted to the conditional Gaussian network paradigm ...
متن کاملInduction of Selective Bayesian Classifiers
In this paper, we examine previous work on the naive Bayesian classifier and review its limitations, which include a sensitivity to correlated features. We respond to this problem by embedding the naive Bayesian induction scheme within an algorithm that carries out a greedy search through the space of features. We hypothesize that this approach will improve asymptotic accuracy in domains that i...
متن کاملارایه یک روش جدید انتشار دادهها با حفظ محرمانگی با هدف بهبود دقّت طبقهبندی روی دادههای گمنام
Data collection and storage has been facilitated by the growth in electronic services, and has led to recording vast amounts of personal information in public and private organizations databases. These records often include sensitive personal information (such as income and diseases) and must be covered from others access. But in some cases, mining the data and extraction of knowledge from thes...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 2 شماره
صفحات -
تاریخ انتشار 2009