Scalable Reference Genome Assembly from Compressed Pan-Genome Index with Spark

Altti Ilari Maarala*, Ossi Arasalo, Daniel Valenzuela, Keijo Heljanko, Veli Mäkinen

*Tämän työn vastaava kirjoittaja

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference article in proceedingsScientificvertaisarvioitu

2 Sitaatiot (Scopus)

Abstrakti

High-throughput sequencing (HTS) technologies have enabled rapid sequencing of genomes and large-scale genome analytics with massive data sets. Traditionally, genetic variation analyses have been based on the human reference genome assembled from a relatively small human population. However, genetic variation could be discovered more comprehensively by using a collection of genomes i.e., pan-genome as a reference. The pan-genomic references can be assembled from larger populations or a specific population under study. Moreover, exploiting the pan-genomic references with current bioinformatics tools requires efficient compression and indexing methods. To be able to leverage the accumulating genomic data, the power of distributed and parallel computing has to be harnessed for the new genome analysis pipelines. We propose a scalable distributed pipeline, PanGenSpark, for compressing and indexing pan-genomes and assembling a reference genome from the pan-genomic index. We experimentally show the scalability of the PanGenSpark with human pan-genomes in a distributed Spark cluster comprising 448 cores distributed to 26 computing nodes. Assembling a consensus genome of a pan-genome including 50 human individuals was performed in 215 min and with 500 human individuals in 1468 min. The index of 1.41 TB pan-genome was compressed into a size of 164.5 GB in our experiments.

AlkuperäiskieliEnglanti
OtsikkoBig Data – BigData 2020 - 9th International Conference, Held as Part of the Services Conference Federation, SCF 2020, Proceedings
ToimittajatSurya Nepal, Wenqi Cao, Aziz Nasridinov, MD Zakirul Alam Bhuiyan, Xuan Guo, Liang-Jie Zhang
KustantajaSpringer
Sivut68-84
Sivumäärä17
ISBN (painettu)9783030596118
DOI - pysyväislinkit
TilaJulkaistu - 1 tammik. 2020
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaIEEE International Conference on Big Data - Honolulu, Yhdysvallat
Kesto: 18 syysk. 202020 syysk. 2020
Konferenssinumero: 9

Julkaisusarja

NimiLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
KustantajaSpringer
Vuosikerta12402 LNCS
ISSN (painettu)0302-9743
ISSN (elektroninen)1611-3349

Conference

ConferenceIEEE International Conference on Big Data
LyhennettäBigData
Maa/AlueYhdysvallat
KaupunkiHonolulu
Ajanjakso18/09/202020/09/2020

Sormenjälki

Sukella tutkimusaiheisiin 'Scalable Reference Genome Assembly from Compressed Pan-Genome Index with Spark'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä