Collaborative learning from distributed data with differentially private synthetic data

Lukas Prediger*, Joonas Jälkö, Antti Honkela, Samuel Kaski

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

1 Citation (Scopus)
10 Downloads (Pure)

Abstract

Background
Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank.

Methods
We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study’s Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores.

Results
We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups.

Conclusions
Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
Original languageEnglish
Article number167
Pages (from-to)1-14
Number of pages14
JournalBMC Medical Informatics and Decision Making
Volume24
Issue number1
DOIs
Publication statusPublished - 14 Jun 2024
MoE publication typeA1 Journal article-refereed

Keywords

  • collaborative learning
  • differential privacy
  • health informatics
  • synthetic data

Fingerprint

Dive into the research topics of 'Collaborative learning from distributed data with differentially private synthetic data'. Together they form a unique fingerprint.

Cite this