Impacts of data synthesis : a metric for quantifiable data standards and performances |
|
Author: | Chandra, Gunjan1; Siirtola, Pekka1; Tamminen, Satu1; |
Organizations: |
1Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland 2Pediatric Research Center, Children’s Hospital, University of Helsinki and Helsinki University Hospital, Yliopistonkatu 4, 00100 Helsinki, Finland 3Research Program for Clinical and Molecular Metabolism, Faculty of Medicine, University of Helsinki, Yliopistonkatu 3, 00014 Helsinki, Finland
4Department of Paediatrics, University of Oulu, Oulu University Hospital, Kajaanintie 50, 90220 Oulu, Finland
|
Format: | article |
Version: | published version |
Access: | open |
Online Access: | PDF Full Text (PDF, 1.3 MB) |
Persistent link: | http://urn.fi/urn:nbn:fi-fe2023061656086 |
Language: | English |
Published: |
Multidisciplinary Digital Publishing Institute,
2022
|
Publish Date: | 2023-06-16 |
Description: |
AbstractClinical data analysis could lead to breakthroughs. However, clinical data contain sensitive information about participants that could be utilized for unethical activities, such as blackmailing, identity theft, mass surveillance, or social engineering. Data anonymization is a standard step during data collection, before sharing, to overcome the risk of disclosure. However, conventional data anonymization techniques are not foolproof and also hinder the opportunity for personalized evaluations. Much research has been done for synthetic data generation using generative adversarial networks and many other machine learning methods; however, these methods are either not free to use or are limited in capacity. This study evaluates the performance of an emerging tool named synthpop, an R package producing synthetic data as an alternative approach for data anonymization. This paper establishes data standards derived from the original data set based on the utilities and quality of information and measures variations in the synthetic data set to evaluate the performance of the data synthesis process. The methods to assess the utility of the synthetic data set can be broadly divided into two approaches: general utility and specific utility. General utility assesses whether synthetic data have overall similarities in the statistical properties and multivariate relationships with the original data set. Simultaneously, the specific utility assesses the similarity of a fitted model’s performance on the synthetic data to its performance on the original data. The quality of information is assessed by comparing variations in entropy bits and mutual information to response variables within the original and synthetic data sets. The study reveals that synthetic data succeeded at all utility tests with a statistically non-significant difference and not only preserved the utilities but also preserved the complexity of the original data set according to the data standard established in this study. Therefore, synthpop fulfills all the necessities and unfolds a wide range of opportunities for the research community, including easy data sharing and information protection. see all
|
Series: |
Data |
ISSN: | 2306-5729 |
ISSN-E: | 2306-5729 |
ISSN-L: | 2306-5729 |
Volume: | 7 |
Issue: | 12 |
Article number: | 178 |
DOI: | 10.3390/data7120178 |
OADOI: | https://oadoi.org/10.3390/data7120178 |
Type of Publication: |
A1 Journal article – refereed |
Field of Science: |
113 Computer and information sciences |
Subjects: | |
Funding: |
This study is funded by the HTx project, which has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 825162. HTx is a Horizon 2020 project supported by the European Union, lasting five years from January 2019. |
EU Grant Number: |
(825162) HTx - Next Generation Health Technology Assessment to support patient-centred, societally oriented, real-time decision-making on access and reimbursement for health technologies throughout Europe |
Dataset Reference: |
More information on DIPP data and its owners can be found on the DIPP data website: http://dipp.fi (accessed on 4 December 2022). WDBC data supporting reported results can be found in [9,56] and links to the data set analyzed: https://archive.ics.uci.edu/ml/ datasets/breast+cancer+wisconsin+(diagnostic) (accessed on 4 December 2022). |
Copyright information: |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
https://creativecommons.org/licenses/by/4.0/ |