Machine learning methods for structural elucidation in untargeted metabolomics

Eric Bach

Research output: ThesisDoctoral ThesisCollection of Articles


The structural elucidation of small molecules remains a bottleneck in untargeted metabolomics and hence is a limitation in many research fields, such as drug-discovery, biotechnology or environment science. The chemical space of small molecules is vast and highly complex, making structural elucidation a challenging task. Liquid chromatography (LC) coupled with tandem-mass spectrometry (MS²) is one of the leading analysis platform in untargeted metabolomics. This platform, called LC-MS², allows for high-throughput and can detect thousands of molecules simultaneously. However, only a small fraction of the detected molecules can be elucidated using reference databases. For the remaining "dark matter" automated computation tools are indispensable, which use large structure databases for the sample annotation. This thesis introduces different machine learning frameworks for the prediction of molecular structure annotations from LC-MS². Publication I presents a novel kernel-based method for molecular structure prediction given an MS² spectrum. It integrates structure databases into the model training instead of using them only in the prediction phase. This is achieved by so-called Magnitude-Preserving Input Output Kernel Regression, which can significantly improve the structure annotation accuracy compared to state-of-the-art methods. LC retention times (RT) are a valuable information source and readily available in LC-MS². However, RTs remain underutilized in automated structure annotation tools. One reason for this is that RTs are LC specific and hence generally not directly transferable between analysis platforms. Publication II introduces a novel framework for retention order (RO) prediction using a Ranking Support Vector Machine. Retention orders are better preserved across LC methods. We demonstrate that our model, integrating multiple RT datasets, predicts ROs with high accuracy. Publication III presents a Markov Random Field model integrating RO and MS² information for structure annotation. It jointly annotates the molecules in an LC-MS² dataset, thereby exploiting pairwise RO dependencies between the molecules. We demonstrate that the integration of ROs can significantly improved the structure annotations. Publication IV introduces a framework for the joint prediction of structure annotation using a Structure Support Vector Machine model called LC-MS²Struct. The novel LC-MS²Struct model is trained using ground-truth annotated full LC-MS² datasets and learns to optimally combine the RO and MS² information. LC-MS²Struct outperforms alternative approaches by a large margin and annotates stereoisomers with high accuracy. The methods presented in this thesis are of significance for the metabolomics community as they improve the structure annotations in LC-MS² analyses and demonstrate how LC RTs can be integrated into automated workflows.
Translated title of the contributionMachine learning methods for structural elucidation in untargeted metabolomics
Original languageEnglish
QualificationDoctor's degree
Awarding Institution
  • Aalto University
  • Rousu, Juho, Supervising Professor
Print ISBNs978-952-64-1040-1
Electronic ISBNs978-952-64-1041-8
Publication statusPublished - 2022
MoE publication typeG5 Doctoral dissertation (article)


  • machine learning
  • computational metabolomics
  • kernel methods


Dive into the research topics of 'Machine learning methods for structural elucidation in untargeted metabolomics'. Together they form a unique fingerprint.

Cite this