Home     Publications     Links     Conferences     History            20 20 20

Data Privacy

Privacy Preserving Data Mining (PPDM)

Statistical Disclosure Control (SDC)


PPDM / SDC / Inference Control: Privacy Preserving Data Mining (PPDM), Statistical Disclosure Control (SDC) and Inference Control are disciplines whose goal is to allow dissemination/transfer of respondent data while preserving respondent privacy. To that end, techniques have been defined that transform an original dataset into a protected dataset such that:
i) analyses on the original and protected datasets yield similar results (data utility);
ii) information in the protected dataset is unlikely to be linkable to the particular respondent it originated from (data safety).
Protection Procedures are usually classified into two main families: cryptographic and perturbatives. I prefer to classify them in data-driven (or general purpose), computation-driven (or specific purpose), and result-driven protection procedures.
  • Data-driven or general purpose: when it is not known the intended use of the data to be prepared for publication. E.g., some users might apply regression, other compute means, and AI-related people classification or association rules. Perturbative methods are appropriate for this purpose.
  • Computation-driven or specific purpose: when it is known the type of analysis to be performed on the data (e.g., association rules). In this case, protection can be done so that the results on the protected data are the same than on the original data. Nevertheless, in this case, the best approach is that the data owner and the data analyser agree on a cryptographic protocol so that the analysis can be done with no information loss. The case of distributed data with a specific goal falls also in this class.
  • Result-driven: when privacy concerns to the result of applying a particular data mining method to some particular data. Protection methods have been designed so that e.g. the resulting association rules from a data set do not disclosure sensitive information for a particular individual.
We are working on data-driven protection methods. These methods can be classified into three categories, according to their manipulation of the original data:
  • Perturbative: Data is distorted in some way that causes the protected data set to contain some errors. The simplest approach is to add noose (additive noise). Other methods exists as e.g. microaggregation, rank swapping, additive and multiplicative noise, PRAM
  • Non perturbative: Data is distorted but no errors are included in the protected data set. Protection is achieved replacing values by less specific ones (e.g., a number is replaced by an interval). In short, non perturbative methods reduce the level of detail of the dataset.
  • Synthetic Data Generators: Data is not distorted, but new data is created and used to replace the original one. Some claim that synthetic data avoids disclosure risk, but this is not so if synthetic data has enough quality. See our paper at PSD 2006: (PSD 2006) (full reference here)
More details in:
Torra, V. (2010) Privacy in Data Mining, in O. Maimon, L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, 2nd Edition, Springer 687-716.
Key issues The following elements can be distinguished as key issues for data protection:
- Masking methods: methods to manipulate data so they are protected. They can be classified as: Perturbative, non-perturbative and synthetic methods.
- Information loss measures: they evaluate in what extend protected data is useful for researchers and decision makers.
- Disclosure risk measures: they evaluate in what extend confidentiality is ensured.
- Transparency principle: protected/masked data has to be published informing on how the data has been protected. That is, the masking method applied as well as its parameterizations.

Data privacy users: E-commerce companies (to exchange customer information) and statistical agencies (to provide the society with accurate statistical information). They need to take advantage of the huge amount of information they collect but at the same time preserving individuals privacy.

Cite this site as:
V. Torra, Data privacy, Springer, 2017 (forthcomming). Associated website: http://www.ppdm.cat/dp/

Vicenç Torra, Last modified: 15 : 34 December 11 2014.