Skip to main navigation Skip to search Skip to main content

An interpretable molecular descriptor for machine learning predictions in atmospheric science

Research output: Contribution to journalArticleScientificpeer-review

3 Downloads (Pure)

Abstract

The study of aerosol formation and chemistry using machine learning is limited by the lack of molecular descriptors suited to atmospheric compounds. Interpretable models are particularly affected because they often rely on dictionary-based descriptors tied to specific molecular substructures, which currently fail to capture the full range of organic atmospheric compounds, including large, highly oxidized molecules common in the atmosphere. We introduce ATMOMACCS, an interpretable descriptor combining the 166 binary keys of the MACCS fingerprint with motifs inspired by the SIMPOL method for estimating saturation vapor pressures. We show that ATMOMACCS outperforms the RDKit topological fingerprint in kernel ridge regression models, improving predictions of saturation vapor pressures (7%, 8%, 29%, and 43% error reduction), equilibrium partition coefficients (5% and 9% error reduction), glass transition temperatures (22% error reduction), and enthalpies of vaporization (61% error reduction) on six datasets with atmospheric compounds. Feature analysis shows that saturation vapor pressure and partition coefficients are governed by carbon number and oxygen-related features, whereas other phase-transition properties (e.g., enthalpy of vaporization and glass transition temperature) depend on carbon–hydrogen bond types and the presence of heteroatoms other than oxygen. This highlights the generalizability of ATMOMACCS across different datasets and properties as an interpretable molecular descriptor.

Original languageEnglish
Article number084115
Pages (from-to)1-19
Number of pages19
JournalJournal of Chemical Physics
Volume164
Issue number8
DOIs
Publication statusPublished - 28 Feb 2026
MoE publication typeA1 Journal article-refereed

Funding

This study was supported by the Research Council of Finland through Project No. 346377, the EU COST Actions Grant Nos. CA18234 and CA22154, and the European Commission through the Marie Skłodowska-Curie Actions (MSCA) under Grant Agreement No. 101203938. We further acknowledge CSC-IT Center for Science, Finland, and the Aalto Science-IT project. The authors acknowledge Theo Kurtén for insightful discussions.

Fingerprint

Dive into the research topics of 'An interpretable molecular descriptor for machine learning predictions in atmospheric science'. Together they form a unique fingerprint.
  • Science-IT

    Hakala, M. (Manager)

    School of Science

    Facility/equipment: Facility

Cite this