TY - JOUR
T1 - HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
AU - Wharrie, Sophie
AU - Yang, Zhiyu
AU - Raj, Vishnu
AU - Monti, Remo
AU - Gupta, Rahul
AU - Wang, Ying
AU - Martin, Alicia R.
AU - O'Connor, Luke
AU - Kaski, Samuel
AU - Marttinen, Pekka
AU - Palamara, Pier Francesco
AU - Lippert, Christoph
AU - Ganna, Andrea
AU - INTERVENE Consortium
PY - 2022/12/22
Y1 - 2022/12/22
N2 - Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.
AB - Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.
UR - http://dx.doi.org/10.1101/2022.12.22.521552
UR - https://github.com/intervene-EU-H2020/synthetic_data
U2 - 10.1101/2022.12.22.521552
DO - 10.1101/2022.12.22.521552
M3 - Article
JO - bioRxiv
JF - bioRxiv
ER -