MassSpecGym : A benchmark for the discovery and identification of molecules

Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A. Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo Wang, David S. Wishart, Li Ping Liu, Juho RousuWout Bittremieux, Hannes Rost, Tytus D. Mak, Soha Hassoun, Florian Huber, Justin J.J. van der Hooft, Michael A. Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

1 Citation (Scopus)

Abstract

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym - the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.

Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 37 (NeurIPS 2024)
PublisherCurran Associates Inc.
Pages1-18
Number of pages18
ISBN (Print)9798331314385
Publication statusPublished - 2025
MoE publication typeA4 Conference publication
EventConference on Neural Information Processing Systems - Vancouver, Canada, Vancouver , Canada
Duration: 10 Dec 202415 Dec 2024
Conference number: 38
https://neurips.cc/Conferences/2024

Publication series

NameAdvances in Neural Information Processing Systems
PublisherNeural Information Processing Systems Foundation
ISSN (Print)1049-5258

Conference

ConferenceConference on Neural Information Processing Systems
Abbreviated titleNeurIPS
Country/TerritoryCanada
CityVancouver
Period10/12/202415/12/2024
Internet address

Fingerprint

Dive into the research topics of 'MassSpecGym : A benchmark for the discovery and identification of molecules'. Together they form a unique fingerprint.

Cite this