Anonymisation of data by synthesising data
The creation of synthetic datasets has been proposed as a statistical disclosure control solution, especially to generate public use files from confidential data or datasets shared within an organisation or company. It is also a tool to create ''augmented datasets'' to serve as input for micro-simulation models. The performance and acceptability of such a tool relies heavily on the quality of the synthetic data, i.e., on the statistical similarity between the synthetic and the true population of interest. Multiple approaches and tools have been developed to generate synthetic data. These approaches can be categorised into four main groups: synthetic reconstruction, combinatorial optimisation, model-based generation, and deep learning approaches. In addition, methods have been formulated to evaluate the quality of synthetic data.
In this presentation, the methods are not shown from the theoretical point of view; they are rather introduced in an applied and generally understandable fashion. We focus on new concepts for the model-based generation of synthetic data that avoids disclosure problems. In the end of the presentation, we introduce simPop, an open-source data synthesizer. simPop is a user-friendly R-package based on a modular object-oriented concept. It provides a highly optimised S4 class implementation of various methods, including calibration by iterative proportional fitting/updating and simulated annealing, and modeling or data fusion by logistic regression, regression tree methods and many other methods. Utility functions to deal with (age) heaping are implemented as well. An example is shown using real data from Official Statistics. The simulated data then serves as input for agent-based simulation and/or microsimulation, or they can be shared within a company or organisation or between organisations without running into troubles with laws on privacy and data protection. Synthetic data can be even used as open data for research and teaching.
Dr Matthias Templ, Zurich University of Applied Sciences (ZHAW School of Engineering)
Please note that this session will not be recorded